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The  Final  Report  for  Project  Entitled: 


Implementation  of  Computer  Assisted  Breast  Cancer  Diagnosis 
(US  Army  Grant  No.  DAMD17-93-J-3007) 


Army  grant  DAMD17-93-J-3007  was  initiated  in  December  1992  for  completion  in  December 
1995.  This  was  extended  to  a  completion  date  of  June  1996.  This  represents  the  final  report  for  this 
project. 


1.  Introduction 

Recently,  several  investigators  have  proposed  a  number  of  methods  for  the  automatic  detection  of 
microcalcifications  and  masses  on  mammograms.  Significant  improvements  in  accuracy  have  been  made 
since  the  initial  attempt  [Chan  1987;  1988]  to  apply  the  computer  algorithms  for  the  detection  of 
microcalcifications.  We  believe  that  it  is  important  to  implement  the  program  into  a  high  speed 
workstation  and  conduct  a  large  scale  clinical  trial  in  order  to  evaluate  its  clinical  practicability  and 
limitations.  Although  the  false-positive  rate  for  the  detection  of  masses  is  still  very  high,  we  have  been 
using  an  artificial  neural  network  to  classify  malignant  and  benign  masses.  We  believe  that  the  creation 
of  a  computer  program  to  analyze  features  of  suspected  masses  will  give  rise  to  a  more  useful  and 
fundamental  approach  to  computer-aided  diagnosis. 

Because  digital  manunography  produces  a  large  data  volume  for  its  high-resolution  imaging,  data 
compression  is  an  important  means  to  facilitate  the  mammographic  image  transmission  and  storage.  We 
have  studied  characteristics  of  the  mammograms  and  developed  compression  methods  specifically  for 
mammograms  using  gray  value  splitting  in  conjunction  with  wavelet  and  full-frame  discrete  cosine 
transform  (DCT)  techniques.  Effects  of  applying  the  data  compression  to  the  proposed  computer  aided 
diagnosis  (CADx)  scheme  in  the  detection  of  microcalcifications  were  also  tested  during  this  reporting 
period. 


2.  Research  in  the  Detection  of  Microcalcifications 

2.1.  Detection  of  Suspected  Microcalcifications 

Microcalcifications  in  breast  cancer  are  reported  to  occur  with  five  or  more  microcalcifications  as 
a  cluster  in  a  Icm^  area  [Black  1965,  Fisher  1975].  When  the  digitization  pixel  size  is  50  pm  (using  a 
Lumiscan  150),  there  are  40,000  pixels  in  a  Icm^  area.  To  have  five  detections  or  pixels  (0.0125%) 
possessing  high  intensity  in  the  area  means  that  one  should  set  a  threshold  on  pixel  intensity  of 
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approximately  3.61  a  (a:  standard  deviation).  In  one  experiment,  we  used  3.02 o  as  the  threshold 
corresponding  to  a  maximum  of  50  pixels  (0.125%  as  indicated  in  Figure  1)  due  to  a  potentially  larger 
microcalcification  containing  several  detected  pixels  together.  Note  that  a  background  trend  correction 
was  applied  to  each  image  block  prior  to  the  statistical  calculation.  The  previously  detected  suspected 
areas  (i.e.,  50  pixels)  were  masked  with  the  mean  value  in  this  detecting  procedure.  This  procedure  was 
performed  with  a  Icm^  template  (200  x  200  pixels)  by  moving  190  pixels  per  step  for  each  operation  and 
by  scanning  through  the  mammogram  horizontally  and  then  vertically. 


3.02O 

Figure  1.  Assuming  the  noise  spectrum  fits  Gaussian  distribution,  only  0.125%  of  pixels  have  an 

intensity  higher  than  3.02  a. 

After  carefully  evaluating  twenty-two  mammograms  containing  subtle  microcalcifications  (only 
three  clustered  microcalcifications  on  three  mammograms  were  associated  with  malignant  process),  we 
found  that  the  use  of  3.02  a  for  the  threshold  value  was  fine  except  for  radiolucent  regions  (OD  >  2.3) 
where  a  threshold  value  should  be  set  at  2.75  a  corresponding  to  120  pixels  (0.3%)  in  a  1  cm^  area.  In 
addition,  when  a  large  area  was  detected  (>  30  pixels)  then  additional  pixels  corresponding  to  the  area 
would  be  granted  in  the  local  operation.  Our  results  indicated  that  all  microcalcifications  (27  clusters 
confirmed  by  biopsy  and  126  singles  were  confirmed  by  an  experienced  radiologist)  were  detected 
through  the  above  procedure.  However,  an  average  of  858  suspected  areas  per  mammogram  was 
obtained  (i.e.,  99.5%  false-positive  rate  for  100%  true-positive  detection).  This  procedure  is  equivalent 
to  a  pre-scan  process  of  a  computer-aided  diagnosis  in  the  detection  of  microcalcifications  [Chan  1987; 
1990].  The  important  point  here  is  that  we  have  developed  an  effective  computer  program  that  can  detect 
all  microcalcifications.  It  takes  5-7  seconds  on  a  DEC  Alpha  computer  to  run  a  digital  mammogram  of 
4,096  x5, 120  pixels.  The  suspected  areas  will  be  used  for  the  further  evaluation  of  CADx  using  more 
stringent  criteria  and  in  the  mammographic  image  compression  for  error  handling  in  the  next  section. 


3.  Adaptive  Lossless  Mammographic  Image  Compression 

We  have  also  developed  an  adaptive  lossless  compression  scheme  for  mammograms  by 
combining  a  high  compression  method  and  techniques  involving  the  detection  of  all  suspected 
microcalcifications  to  ensure  data  accuracy  in  the  clinically  significant  areas.  In  the  previous  section,  we 
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described  how  to  detect  suspected  microcalcifications.  It  is  no  a  big  task  to  handle  858  suspected  areas 
when  compared  to  the  compression  of  a  4Kx5K  mammogram.  However,  we  can  preserve  the 
maximum  data  accuracy  on  clinically  significant  areas.  This  type  of  error  control  should  be  used  in  any 
medical  image  compression  scheme  when  possible. 


3.1.  Mammographic  Image  Compression  via  Wavelet  Decomposition 

Recently,  we  have  used  a  wavelet  transform  for  mammographic  image  compression  [Daubechies 
1988,  Mallat  1989,  Cody  1992,  Atonini  1992].  Before  the  wavelet  transform,  the  boundary  of  the 
breast  was  outlined.  Only  the  area  within  the  boundary  was  the  area  to  be  compressed.  Figure  2  shows 
a  typical  multi-level  wavelet  transform  and  the  associated  compression  procedure.  The  larger  the  image, 
the  more  levels  of  wavelet  transform  can  be  apphed.  In  general,  “A”  contains  a  much  smaller  computer 
space  than  “B”  and  “A”  space  +  “B”  space  is  about  4Kx5Kx3  bit  (a  compression  ratio  of  4: 1).  If  the 
air  region  is  included  in  the  compression  process,  the  average  error-free  compression  ratio  is  =2.5:1. 


Bit  allocation,  quantization,  and 
error-free  coding 


Quantization  errors  can  be  encoded  by 
an  error-free  coding 


0 


Figure  2.  A  typical  wavelet  decomposition  and  associated  compression  procedure  for  a  mammogram. 

(Note:  only  a  two-level  decomposition  is  shown.) 


In  this  study,  we  decomposed  each  image  with  7-level  wavelet  transform;  hence,  the  smallest  size- 
image  will  be  a  matrix  of  128  x  160  pixels.  The  lowest  resolution  subimage  will  be  further  decomposed 
by  an  operation  called  deferential  pulse  code  modulation  (DPCM).  The  entropy  of  the  all-decomposed 
subimages  will  be  calculated  to  determine  the  best  wavelet  kernel  for  the  mammographic  image 
compression. 

3.2.  Error-Controlled  Compression  for  Digital  Mammograms 

We  believe  that  an  accurate  error-control  procedure  is  an  innovative  solution  to  make  a 
compression  scheme  clinically  useful.  A  computer  scheme  for  the  compression  was  tested  and  is 
described  as  follows: 
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(a)  Detect  all  suspected  microcalcifications  (clusters  and  singles)  based  on  the  method  described 
in  Section  2. 

(b)  Perform  an  error-free  compression  using  DPCM  and  arithmetic  coding  on  the  detected  areas. 
Replace  the  area  with  surrounding  intensity  using  cubic  spline  interpolation. 

(c)  Perform  multi-level  wavelet  transform  for  the  mammogram. 

(d)  Perform  quantization  on  the  wavelet  domain  (For  the  higher  level  of  low  resolution 
subimages  the  less  destructive  quantization  should  be  applied.) 

(e)  Perform  an  entropy  coding  on  quantized  subimages  to  get  file  “A”  indicated  in  Figure  2. 
(arithmetic  coding  [Witten  1987]  for  uncorrelated  coefficients  and  L-Z  coding  [Ziv  1978]  for 
correlated  data  sequence). 

3.3.  Experimental  Results 

The  unique  point  of  this  work  is  to  add  the  error-free  feature  for  the  suspected  disease  areas  to  a 
compression  scheme.  No  compression  artifact  shall  be  observed  by  an  experienced  breast  radiologist. 
One  must  realize  that  there  is  no  need  to  digitize  a  resolution  as  high  as  50|im/pixel  except  those  areas 
containing  subtle  microcalcifications.  However,  the  error  control  feature  reduced  some  degrees  of  the 
entire  compression  efficiency  (ratio).  Equation  (1)  provides  a  formula  to  calculate  the  effective 
compression  ratio  when  the  error-control  feature  is  added  into  the  compression  system: 

- R21A21I -  ...(1) 

{R-R,)xNxS  +  R,T 

where  T  is  the  total  number  of  pixels  in  the  original  mammogram,  S  is  the  number  of  pixels  in  the 
suspected  area  for  error-free  coding,  N  denotes  the  number  of  suspected  areas,  R  is  the  compression 
ratio  obtained  by  performing  a  transform  (wavelet)  coding,  Rg  is  the  average  compression  ratio  to  encode 
microcalcification  areas  losslessly,  and  Rt  is  the  total  effective  compression  ratio. 

We  tested  the  same  twenty-two  manmiograms  as  used  in  Section  3.  We  calculated  the  effective 
compression  ratio  by  providing  values: 

N  »  858; 

5  =  640  (=25  x  25  pixels)  which  was  averaged  from  81%  tiny  suspects  requiring  20  x20  pixels 

(i.e.,  Immx  1mm  area)  and  19%  medium-sized  suspects  requiring  40x40  pixels; 

T  =20,971,520(4,096x5,120); 

Re  «  2.5; 
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i?  »  40: 1  (estimated  acceptable  compression  ratio)  which  is  partly  due  to  the  fact  that  =50%  of 
mammogram  contains  air  space. 

Substimting  the  above  values  into  Equation  ('!’).  we  received  Rt  ~  29  which  also  indicates  that  an 
additional  40%  of  the  compressed  data  was  increased  when  the  error-free  feature  was  added  to  the 
compression  scheme.  Since  each  12-bit  datum  is  stored  in  a  16-bit  computer  space.  Rt  was  38  for 
current  commercial  data  systems.  Because  the  suspected  areas  mav  contain  significant  clinical 
information,  we  believe  that  the  error  control  feature  is  necessary  and  is  a  cost-effective  approach  for 
mammography  data  reduction. 


4.  Recognition  of  Mammographic  Microcalcifications  with  an  Artificial  Neural  Network 

4.1.  Detection  of  Clustered  Microcalcifications 

We  have  developed  a  computer-aided  diagnosis  (CADx)  program  for  automated  detection  of 
clustered  microcalcifications  in  digital  mammograms.  In  this  study,  we  investigated  the  use  of  a 
convolution  neural  network  (CNN)  in  conjunction  with  the  CADx  program  to  reduce  false-positive  (FP) 
detections. 

Screen-film  mammograms  containing  subtle  microcalcifications  were  digitized  with  a  laser  film 
scanner.  After  signal-to-noise  ratio  (SNR)  enhancement  and  background  removal  with  a  spatial  filter, 
potential  signal  sites  were  detected  with  a  locally  adaptive  gray-level  thresholding  technique.  The  size 
and  contrast  were  used  to  discriminate  false  signals  from  true  microcalcifications.  The  remaining  signals 
were  then  inspected  by  the  CNN.  Image  blocks  containing  individual  microcalcifications  in  the  SNR- 
enhanced  images  were  input  to  the  CNN.  The  CNN  consisted  of  nodes  organized  in  groups  and  the 
weights  connecting  the  nodes  were  organized  by  convolution  kernels.  These  weights  integrated 
neighborhood  information  for  recognition  of  the  true  signals.  After  training,  we  found  that  a  CNN  with 
two  hidden  layers,  both  containing  10  groups  of  nodes,  was  effective  in  the  classification  of  true  and 
false  signals.  The  output  signals  from  the  CNN  further  underwent  a  regional  clustering  algorithm  for 
detection  of  clustered  microcalcifications. 

We  found  that  the  CNN  could  classify  individual  microcalcifications  with  the  area  under  the  ROC 
curve,  Az,  of  0.88.  Free-response  ROC  (i.e.,  FROC)  analysis  showed  that  the  addition  of  CNN 
classification  to  the  CADx  program  reduced  the  false-positive  cluster  detection  by  60-70%  for  a  given 
true-positive  (TP)  rate.  After  adding  a  criterion  regarding  a  minimum  of  three  calcifications  in  one  cluster 
for  a  detection,  the  Az  was  increased  to  0.96.  These  results  indicate  that  the  CNN  can  significantly 
increase  the  accuracy  of  the  CADx  program. 
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4.2.  Classification  of  Malignant  and  Benign  Clustered  Microcalcifications 

We  have  developed  computer  vision  methods  for  classification  of  malignant  and  benign  clustered 
microcalcifications.  Mammograms  are  digitized  at  a  pixel  size  of  35  mm  x  35  mm.  The  program 
operates  locally  in  regions  of  interest  (ROIs)  containing  clusters  of  microcalcifications  on  the 
mammograms.  Morphological  features  characterizing  the  microcalcLQcations  and  texture  features 
characterizing  textural  changes  in  the  tissue  region  surrounding  the  cluster  are  extracted  from  the  ROIs. 
For  extraction  of  texture  features,  we  first  employ  a  distance-weighted  interpolation  technique  to  estimate 
the  low-frequency  background  of  the  ROI  using  a  band  of  pixels  around  its  perimeter.  The  spatial  gray 
level  dependence  (SOLD)  matrices  of  the  background-corrected  ROI  are  determined  at  various  pixel 
pair  distances.  Thirteen  texture  features  that  characterize  the  ROI,  such  as  correlation,  energy,  inertia, 
inverse  difference  moment,  and  entropy,  are  calculated  from  the  SOLD  matrices. 

For  extraction  of  morphological  features,  a  segmentation  method  is  used  similar  to  that  in  the 
detection  program  except  that  segmentation  is  applied  to  an  imfiltered  image  in  order  to  avoid  distortion 
of  its  shape  due  to  signal-to-noise  ratio  (SNR)  enhancement.  An  ROI  containing  a  microcalcification  is 
background-corrected  and  the  signal  is  extracted  based  on  the  local  SNR  using  a  region  growing 
technique.  We  calculate  visibility  descriptors  such  as  the  SNR,  mean  density,  and  size  of  the 
microcalcifications,  shape  descriptors  such  as  the  second  moments,  the  ratio  of  the  second  moments,  the 
eccentricity  and  the  ratio  of  major  and  minor  axes  of  an  effective  ellipse,  and  determine  cluster  features 
such  as  the  standard  deviation,  the  maximum,  and  the  coefficient  of  variations  of  the  visibility 
descriptors,  shape  descriptors,  and  the  number  of  microcalcifications  within  the  cluster.  We  have  trained 
a  linear  discriminant  classifier  (LDA)  to  classify  the  input  features.  The  performance  of  the  trained 
classifier  has  been  tested  both  with  a  jackknife  method  and  a  cross-validation  method.  Both  methods 
yielded  similar  test  results.  The  discriminant  scores  of  the  LDA  were  analyzed  with  Receiver  Operating 
Characteristic  (ROC)  methodology  and  the  area  under  the  ROC  curve  (Az)  was  used  as  a  performance 
index.  In  the  texture  feature  space,  the  LDA  classifier  achieved  an  Az  of  0.88  for  training  and  0.84  for 
testing.  In  the  morphological  feature  space,  the  LDA  classifier  achieved  an  Az  of  0.84  for  training  and 
0.79  for  testing.  In  the  combined  texture  and  morphological  features,  the  Azs  were  improved  to  0.94 
and  0.89i  respectively,  for  training  and  testing.  We  have  also  trained  a  non-linear  classifier,  a  back- 
propagation  neural  network  (BPN),  to  classify  the  malignant  and  benign  microcalcifications.  In  the 
texture  feature  space,  the  BPN  classifier  achieved  an  Az  of  0.88  for  training  and  0.86  for  testing.  In  the 
morphological  feature  space,  the  BPN  classifier  achieved  an  Az  of  0.84  for  training  and  0.80  for  testing. 
In  the  combined  texture  and  morphological  features,  the  Az's  were  improved  to  0.94  and  0.91, 
respectively,  for  training  and  testing.  These  results  demonstrate  the  feasibility  of  our  approach  to 
classification  of  mahgnant  and  benign  microcalcifications. 
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5.  Recognition  of  Mammographic  Masses 


5.1.  Detection  of  Mammographic  Masses 

(A)  Computerized  detection  of  masses  on  mammograms 

We  have  developed  a  new  approach  for  segmentation  of  suspicious  mass  regions  on  digitized 
mammograms  using  an  adaptive  Density- Weighted  Contrast  Enhancement  (DWCE)  filter  in  conjunction 
with  Laplacian-Gaussian  (LG)  edge  detection.  The  DWCE  filter  can  enhance  masses  of  a  wide  range  of 
intensities  and  sizes,  and  suppress  background  intensity  variations.  The  algorithm  processes  a 
mammogram  in  two  stages.  In  the  first  stage  the  entire  mammogram  is  filtered  globally  using  a  DWCE 
adaptive  filter  which  enhances  the  local  contrast  of  the  image  based  on  its  local  mean  pixel  values.  The 
enhanced  image  is  then  segmented  with  an  LG  edge  detector  into  isolated  objects.  A  feature  classifier 
using  morphological  or  texture  features  is  used  to  reduce  the  number  of  FPs.  In  the  second  stage  of 
processing,  the  DWCE  adaptive  filter  and  the  edge  detector  are  applied  locally  to  each  of  the  segmented 
object  regions  detected  in  the  first  stage.  The  local  operation  allows  more  precise  extraction  of  the 
features  of  the  objects.  The  number  of  objects  is  further  reduced  based  on  these  features.  ROIs  are 
extracted  from  the  image  based  on  the  remaining  object  set.  The  selected  ROIs  are  input  to  either  an  LDA 
classifier  or  a  convolution  neural  network  to  further  differentiate  TPs  and  FPs  as  described  below. 

Using  a  cross-validation  test  method  with  two  partitions,  our  results  indicated  that  the  current  algorithm 
achieved  an  average  test  TP  rate  of  80%  at  about  2.1  FPs/image  and  a  TP  rate  of  90%  at  4.7  FPs/image. 
This  accuracy  may  not  be  adequate  in  clinical  practice,  however,  it  demonstrates  the  feasibility  of 
detecting  masses  on  mammograms  with  the  new  DWCE  technique.  We  therefore  propose  to  evaluate  the 
performance  of  the  algorithm  in  a  preclinical  trial  using  a  large  number  of  randomly  selected  clinical 
cases.  The  causes  of  FP  detections  in  such  a  test  will  be  analyzed,  and  more  effective  FP  reduction 
methods  will  be  developed  in  order  to  improve  the  detection  accuracy. 

(B)  Multiresolution  wavelet  decomposition  and  texture  analysis 

We  have  developed  a  new  method  to  distinguish  abnormal  from  normal  tissue  for  CAD 
algorithms  using  texture  analysis.  An  ROI  containing  mass  or  normal  breast  tissue  is  input  to  the 
program.  The  wavelet  transform  is  used  to  decompose  the  ROI  into  several  scales.  Global 
multiresolution  texture  features  are  calculated  from  the  SGLD  matrices  of  the  low-pass  wavelet 
coefficients  up  to  a  certain  scale  and  then  at  variable  distances  between  the  pixel  pairs.  Texture  features 
in  the  suspicious  object  sub-region  and  their  differences  with  features  in  the  peripheral  sub-regions  of  the 
ROI  are  also  calculated  to  form  a  local  texture  feature  space.  Stepwise  linear  discriminant  analysis  is 
used  to  select  effective  features  from  the  combined  global-local  feature  space  to  maximize  the  separation 
of  mass  and  normal  tissue  ROIs.  To  evaluate  the  accuracy  of  this  method,  we  used  168  ROIs  containing 
a  biopsy-proven  mass  and  508  ROIs  with  normal  dense,  mixed  dense/fatty,  or  fatty  tissues  extracted 
jfrom  digitized  mammograms  by  radiologists.  The  ROIs  were  randomly  and  equally  divided  into  a 
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training  and  a  test  group.  It  was  found  that,  using  the  global  multiresolution  feature  space  alone,  the  Az 
was  0.89  and  0.87  for  the  training  and  test  groups,  respectively.  Using  local  features  only,  the  Az  was 
0.88  and  0.85  for  the  training  and  test  groups,  respectively.  With  the  combined  global  and  local  feature 
spaces,  the  Az  reached  0.95  and  0.91  for  the  training  and  test  groups,  respectively.  When  this 
classification  method  was  applied  to  the  false-positives  detected  by  the  automated  mass  detection 
program  using  the  DWCE  approach  described  above,  the  classification  accuracy  in  terms  of  Az  reached 
0.97  during  training  and  0.96  during  testing  in  the  combined  global  and  local  feature  space.  The  results 
demonstrate  that  an  LDA  using  a  combination  of  the  global  and  the  local  texture  features  can  effectively 
classify  masses  from  normal  tissue  on  mammograms.  This  classifier  will  be  incorporated  into  the 
automated  mass  detection  program  as  one  of  the  steps  to  reduce  FP  detections  in  the  preclinical  trial. 

(O  Artificial  neural  network 

We  have  investigated  the  use  of  a  convolution  neural  network  (CNN)  and  a  backpropagation 
neural  network  (BPN)  for  classification  of  ROIs  on  mammograms  as  either  masses  or  normal  tissue.  A 
CNN  is  a  BPN  with  two-dimensional  weight  kernels  that  operate  on  images.  A  generalized,  fast  and 
stable  implementation  of  the  CNN  has  been  developed.  ROIs  containing  masses  and  normal  breast 
tissue  are  first  segmented  with  an  automated  detection  program.  The  CNN  input  images  are  obtained 
from  the  ROIs  using  (i)  averaging  and  subsampling,  and  (ii)  texture  feature  extraction  from  SOLD 
matrices  and  gray  level  difference  statistics  (OLDS)  vectors  on  smaller  sub-regions  inside  the  ROI.  In 
(ii),  features  computed  over  different  sub-regions  were  arranged  as  texture-images,  and  subsequently 
used  as  inputs  to  the  CNN.  Input  features  to  the  BPN  are  obtained  from  SOLD  matrices  at  multiple 
resolutions.  Using  168  ROIs  containing  masses  and  504  ROIs  containing  normal  tissue,  we  found  that 
the  test  Az  reached  0.83  for  the  CNN  using  spatial  input  images,  0.87  using  spatial  and  texture  images, 
0.88  for  the  BPN  using  SOLD  texture  features,  and  0.91  for  a  combination  of  the  CNN  and  BPN 
outputs.  Our  results  indicate  that  the  CNN  performance  may  be  improved  by  using  additional  texture 
information  and  that  the  overall  performance  may  be  improved  by  combining  CNN  and  BPN  classifiers. 

CD')  Feature  selection 

The  performance  of  a  feature  classifier  in  a  CAD  scheme  depends  strongly  on  feature  selection. 
For  the  LDA,  we  use  a  stepwise  LDA  procedure  to  select  significant  feature  variables  for  the 
classification  tasks.  In  order  to  have  a  general  feature  selection  method  that  can  be  applied  to  both  linear 
and  non-linear  classifiers,  we  have  investigated  the  application  of  a  genetic  algorithm  (GA)  for  feature 
selection.  One  of  our  applications  is  to  select  features  for  the  classification  of  masses  and  normal  breast 
tissue.  ROIs  containing  biopsy-proven  masses  and  normal  ROIs  containing  breast  parenchyma  are  first 
segmented  from  mammograms.  A  total  of  587  texture  and  morphological  features  are  automatically 
extracted  from  each  ROI.  Multiple  regression  is  applied  to  the  features  selected  by  the  GA  to  form  a 
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discriminant  function  with  the  training  set.  The  presence/absence  of  a  feature  in  the  regression  is  coded 
by  a  1  or  0  at  the  appropriate  gene  in  a  chromosome  in  the  GA.  The  fimess  and  survival  rate  of  a 
chromosome  are  determined  by  Az.  The  chromosomes  are  allowed  to  crossover,  mutate,  and  evolve  for 
a  number  of  generations  in  a  training  procedure.  The  final  selected  features  are  used  for  classification  of 
the  test  set.  To  evaluate  the  effectiveness  of  this  GA,  we  used  168  ROIs  with  masses  and  504  ROIs  with 
normal  tissue  as  our  data  set.  We  randomly  divided  the  data  set  into  10  partitions  of  training  and  test 
subsets.  The  GA  selected  an  average  of  20  features  from  the  587  input  features  for  each  training 
process.  It  was  found  that  the  average  training  and  test  Az  values  reached  0.93  and  0.89,  respectively. 
This  accuracy  is  superior  to  that  obtained  with  the  entire  feature  set  input  to  the  classifier  without  feature 
selection,  or  that  with  features  selected  individually  based  on  their  distributions.  We  also  compared  the 
results  to  feature  selection  using  the  stepwise  LDA  method.  Using  the  same  cross-validation  test 
technique,  the  test  Az's  obtained  with  both  methods  were  similar,  indicating  that  the  GA  and  stepwise 
LDA  approaches  can  provide  near-optimal  feature  selection  for  linear  classifiers. 

5.2  Classification  Of  Malignant  And  Benign  Mammographic  Masses 

We  have  investigated  the  classification  of  malignant  and  benign  masses  on  mammograms.  After 
ROIs  containing  suspicious  masses  are  located  by  the  automated  mass  detection  program  on  the 
mammogram,  segmentation  and  feature  extraction  are  performed  locally  in  each  ROI.  A  new 
segmentation  method  has  been  developed  by  the  research  team  based  on  a  migrating  mean  clustering 
algorithm.  An  ROI  is  first  corrected  for  the  low-fi-equency  structured  background.  This  method  then 
separates  the  mass  from  the  surrounding  background  based  on  clustering  of  pixels  with  similar  gray  level 
and  edge  gradient  information.  The  two  groups  of  pixels  are  coded  as  a  binary  image  so  that  a  simple 
edge  trackmg  algorithm  can  define  the  boundary.  We  extract  morphological  features  such  as  the 
fuzziness  or  spiculation  of  the  mass  margin  which  is  quantified  by  the  root-mean-square  (RMS)  variation 
around  a  smoothed  version  of  the  edge,  the  perimeter-to-area  ratio,  and  shape  features  such  as 
circularity,  rectangularity,  ratio  of  its  axes,  and  shape  features  derived  from  the  normalized  radial  length. 
We  also  extract  texture  features  in  a  40-pixel- wide  boundary  region  surrounding  the  mass  from  the 
SGLD  matrices.  The  features  are  then  input  to  an  LDA  or  a  BPN  classifier  to  distinguish  the  malignant 
and  benign  masses.  The  results  indicated  that  the  migrating  mean  clustering  method  could  extract  mass 
margins  more  closely  than  other  edge  detection  techniques  that  we  tested.  With  the  morphological 
features  and  texture  features  derived  from  the  boundary  regions  surrounding  the  mass,  we  obtained  a 
training  Az  of  0.86  and  a  test  Az  of  0.82  for  a  group  of  85  malignant  and  83  benign  masses.  The  Az 
was  0.86  by  a  radiologist's  visual  evaluation  in  the  same  set  of  mammograms.  This  result  is 
encouraging  although  improved  methods  still  need  to  be  developed  to  further  increase  the  classification 
accuracy  before  clinical  implementation. 
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6.  Status  Report  in  the  Implementation  of  CADx  for  the  Detection  of  Clustered 
MicrocalciHcations 

We  continue  to  work  on  the  CADx  program  with  a  DEC  Alpha  workstation.  The  basic  user 
interface  is  complete.  The  user  interface  can  select  a  mammogram  and  display  it  on  the  workstation. 
Several  image  functions  have  been  implemented:  (1)  "window  and  level"  for  the  adjustment  of  the 
brighmess  and  contrast,  (2)  pan,  (3)  a  cursor  box  for  the  user  to  select  the  area  of  interest,  (4)  print  the 
image  with  CADx  marks  on  high  quality  paper  or  a  laser  film.  Initial  clinical  trial  began  January  15, 
1996  at  the  Breast  Imaging  Division  of  Georgetown  University  Hospital.  The  results  of  this  study  will 
be  presented  at  the  1997  SPIE  Medical  Imaging  Conference  at  Newport  Beach,  California. 


7.  Contractual  (SOW)  Issues 

Dr.  R.V.  Shah,  chief  breast  radiologist  at  Brooke  Army  Medical  Center  and  Dr.  Don  Smith, 
attendant  breast  radiologist  at  Madigan  Army  Medical  Center  have  sent  us  some  proven  cases  (in  the 
Spring  of  1995)  associated  with  mammographic  microcalcifications  for  inclusion  in  our  test  database 
[Private  Conununication].  We  are  in  the  process  of  installing  our  software  for  evaluation  at  Army 
Hospitals.  The  CADx  clinical  trial  can  be  started  anytime  when  they  are  ready  for  the  experiment.  At 
present,  radiologists  would  like  to  have  an  integrated  viewing  system  so  that  thay  can  evaluate  the  effects 
of  CADx  in  a  clinical  setting.  We  are  currently  negotiating  with  R2  Technology  Inc.  who  have  a  product 
that  synchronizes  soft  copy  images  on  small  monitors  mounted  under  the  bench  of  the  mammography 
viewing  alternator  using  a  bar  code  system.  We  plan  to  miniturize  the  mammograms  and  provide  marks 
indicated  by  our  CADx.  The  images  will  be  interfaced  to  the  R2  monitors  to  facilitate  the  clinical  use  of 
this  development.  Dr.  R.V.  Shah  can  be  reached  at  (210)916-4062.  R2  Technology's  phone  number  is 
(415)254-8988. 


8.  Conclusions 

During  the  past  three  years,  we  have  spent  our  effort  not  only  in  algorithm  improvement  but  also 
in  merging  our  newly  developed  algorithm  in  C  and  useful  codes  previously  developed  by  Dr.  Chan  and 
her  colleagues. 

At  this  point,  we  have  completed  our  mammographical  image  compression  and  CADx  research  in 
terms  of  algorithm  improvement  and  computer  speed.  Database  collection  is  underway  and  will  continue 
in  the  clinical  tests  to  be  conducted.  Several  basic  functions  and  user  interface  have  been  implemented  in 
the  workstation.  The  CADx  clinical  trial  has  been  undertaken  at  Georgetown  Uiuversity  Medical  Center. 
We  will  report  the  results  of  the  clinical  test  in  a  future  paper. 


page  14 


References 


Antonini  M,  Barlaud  M,  Mathieu  P,  Daubechies  I:  “Image  Coding  Using  Wavelet  Transform,”  IEEE 
Trans.  Image  Proc.,  vol.  1  No.  2,  1992,  pp.  205  -  220. 

Black  JW,  Young  B:  “A  Radiological  and  Pathological  Study  of  the  Incidence  of  Calcifications  in 
Diseases  of  the  Breast  and  Neoplasms  of  Other  Tissues,”  Br  J  Radiol  1965;38;596. 

Chan  HP,  Doi  K,  Galhotra  S,  Vybomy  CJ,  MacMahon  H,  Jokich  PM:  Image  Feature  Analysis  and 

Computer-aided  Diagnosis  in  Digital  Radiography.  1.  Automated  Detection  of  Microcalcifications 
in  Mammography,”  Med.  Phys.,  1987;14:538. 

Chan  HP,  Doi  K.,  Vybomy  CJ,  et  al.:  "Improvement  in  Radiologists'  Detection  of  Clustered 

Microcalcifications  on  Mammograms:  The  Potential  of  Computer-Aided  Diagnosis,"  Invest. 
Radio,  vol.  25,  1990,  pp.  1102-1110. 

Cody  MA,  “The  Fast  Wavelet  Transform,”  Dr.  Dobb’s  Journal,  April  1992,  pp.  16-28. 

Daubechies  I,  "Orthonormal  Based  of  Compactly  Supported  Wavelets”,  Coram.  on  Pure  and  Appl. 
Math.,  Vol.  XLI,  1988,  pp.  909-996. 

Fisher  ER,  Gregorio  RM,  Fisher  B,  Redmond  C,  Vellios  F,  Sommers  SC:  “The  Pathology  of  Invasive 
Breast  Cancer,”  Cancer  1 975;36: 1 . 

MacMahon  H,  Doi  K,  Sanada  S,  Monmer  SM,  Giger  ML,  Metz  CE,  Nakamori  N,  Yin  F,  Xu  X, 

Yonekawa  H,  and  Takeuchi  H:  "Data  Compression:  Effect  of  Data  Compression  on  Diagnostic 
Accuracy  in  Digital  Chest  Radiography”,  Radiology,  Vol.  178,  No.  1,  Jan.  1991,  pp.  175-179. 

Mallat  S,  “A  Theory  For  Multiresolution  Signal  Decomposition:  The  Wavelet  Representation”,  IEEE 
Trans.  Pat.  Anal.  Mach.  Intel.,  Vol.  11  No.  7,  1989,  pp.  674-693. 

Swets  JA  and  Pickett  RM,  Evaluation  of  Diagnostic  Systems.  Academic  Press,  New  York,  1982. 

Witten  IH,  Neal  RM,  and  Cleary  JG:  "Arithmetic  Coding  for  Data  Compression,"  Comm,  of  the  ACM, 
Vol.  30,  June  1987,  pp.  520-540. 

Ziv  J  and  Lempel  A:  "A  Universal  Algorithm  for  Sequential  Data  Compression,"  IEEE  Trans,  on  Info. 
Theory,  Vol.  IT-23,  No.  3,  May  1977,  pp.  337-343. 


Presentations  and  Publications  During  the  2nd  Year  of  the  Project 

1 .  Petrosian  A,  Chan  HP,  Helvie  MA,  Goodsitt  MM,  Adler  DD:  “Computer-aided  diagnosis  in 
mammography:  classification  of  masses  and  normal  tissue  by  texture  analysis,”  Physics  in  Medicine 
and  Biology  1994;  39:  2273-2288. 

2.  Cheng  SNC,  Chan  HP,  Helvie  MA,  Goodsitt  MM,  Adler  DD,  St.  Clair  D:  “Classification  of  mass 
and  non-mass  regions  on  mammograms  using  artificial  neural  network,”  J.  of  IS&T  1994;  38:  598- 
603. 


page  15 


3.  Lo  SC,  Kim  MB,  Li  H,  Krasner  BH,  Freedman  MT,  and  Mun  SK,  “Radiological  Image 
Compression:  Image  Characteristics  and  Clinical  Consideration,”  SPIE  Proceedings,  Medical 
Imaging  1994,  vol.  2164,  pp.  276-281. 

4.  Wu  YC,  Lo  SC,  Freedman  MT,  Zuurbier  RA,  Hasegawa  A,  Mun  SK:  “Classification  Of 
Microcalcifications  In  Radiographs  Of  Pathological  Specimen  For  The  Diagnosis  Of  Breast  Cancer,” 
Academic  Radiology,  1995,  Vol.  2,  pp.199-204. 

5.  Lo  SC,  Chien  M,  Jong  S,  Li  H,  Freedman  MT,  and  Mun  SK:  “Extraction  of  Rounded  and  Line 
Objects  for  the  Improvement  of  Medical  Image  Pattern  Recognition,”  lEEE/MIC  Proceedings,  Nov. 
1994. 

6.  Lo  SC,  Lin  IS,  Freedman  MT,  and  Mun  SK:  “Application  of  Artificial  Neural  Network  to  Medical 
Image  Pattern  Recongnition,”  WCNN,  INNS  Press,  1994,  Vol.  I,  pp.37-42. 

7 .  Chan  HP,  Wei  D,  Helvie  MA,  Sahiner  B,  Adler  DD,  Goodsitt  MM,  Petrick  N.  Computer-aided 
classification  of  manunographic  masses:  Linear  discriminant  analysis  in  texture  feature  space. 

Physics  in  Medicine  and  Biology.  1995;  40:  857-876. 

8 .  Wei  D,  Chan  HP,  Helvie  MA,  Sahiner  B,  Petrick  N,  Adler  DD,  Goodsitt  MM.  "Classification  of 
mass  and  normal  breast  tissue  on  digital  mammograpms:  Multiresolution  texture  analysis."  Medical 
Physics.  1995;22:  1501-1513. 

9 .  Chan  HP,  Lo  SCB,  Sahiner  B,  Lam  KL,  MA  Helvie.  "Computer-aided  detection  of 
mammographic  microcalcifications:  Pattern  recognition  with  an  artificial  neural  network."  Medical 
Physics.  1995;  22:1555-1567. 

10.  Lo  SCB,  Chan  HP,  Lin  JS,  Li  H,  Freedman  M,  Mun  SK.  "Artificial  convolution  neural  network 
for  medical  image  pattern  recognition."  Neural  Networks.  1995,  Vol  8,  No.  7/8,  pp.1201-1214. 

1 1 .  Lo  SC,  Li  H,  Krasner  BH,  and  Mun  SK,  “FuU-frame  compression  algorithms  of  wavelet  and  cosine 
transform,”  SPIE  Proc.  Med.  Imaging  1995,  Vol.  2431,  pp.  195-202. 

12.  Lo  SC,  Li  H.,  Freedman  MT,  and  Mun  SK,  “Artificial  visual  neural  network  with  wavelet  kernels 
for  general  disease  pattern  recognition,”  SPIE  Proceedings,  Medical  Imaging  1995,  vol.  2434,  pp. 
579-588. 

13.  Chan  HP,  Wei  D,  Lam  KL,  Lo  SCB,  Helvie  MA,  Adler  DD.  "Computerized  detection  and 
classification  of  microcalcifications  on  mammograms."  SPIE  Proc.  Med.  Imaging  1995,  Vo;  2434, 

pp.  612-620. 

1 4.  Sahiner  S,  Chan  HP,  Wei  D,  Helvie  MA,  Petrick  N,  Adler  DD,  Goodsitt  MM:  “Image 
classification  using  a  convolution  neural  network,”  SPIE  Proc.  Med.  Imaging  1995,  Vol.  2434,  pp. 
838-845. 

15.  Petrick  N,  Chan  HP,  Sahiner  B,  Wei  D,  Helvie  MA,  Goodsitt  MM,  Adler  DD:  “Automated 
detection  of  breast  masses  on  digital  mammograms  using  adaptive  density-weighted  contrast 
-enhancement  filtering,”  SPIE  Proc.  Med.  Imaging  1995,  Vol.  2434,  pp.  590-597. 

16.  Wei  D,  Chan  HP,  Helvie  MA,  Sahiner  B,  Petrick  N,  Adler  DD,  Goodsitt  MM:  “Multiresolution 
texture  analysis  for  classification  of  mass  and  normal  breast  tissue  on  digital  mammograms,”  SPIE 
Proc.  Med.  Imaging  1995,  Vol.  2434,  pp.  606-611. 


page  M 


17.  Petrick  N,  Chan  HP,  Sahiner  B,  Wei  D.  An  adaptive  density  weighted  contrast  enhancement  filter 
for  mammographic  breast  mass  detection.  IEEE  Trans.  Medical  Imaging.  1996,  Vol.  15,  No.  1, 
pp.  59-67. 

18.  Lo  SC,  Lin  JS,  Li  H,  Hasegawa  A,  Freedman  MT,  and  Mun  SK,  “Detection  of  subtle  clustered 
microcalcifications  using  fuzzy  modeling  and  convolution  neural  network,”  SPIE  Proceedings, 
Medical  Imaging  on  Image  Processing,  1996,  Vol.  2710,  pp.  8-15. 

19.  Lo  SC,  Li  H,  Wang  Y,  Freedman  MT,  and  Mun  SK,  “On  optimization  of  orthonormal  wavelet 
decomposition:  Data  accuracy,  feature  preservation,  and  compression,”  SPIE  Proceedings,  Medical 
Imaging  on  Image  Display,  1996,  Vol.  2707,  pp.  201-214. 

20.  Osamu  Tsujii,  Akira  Hasegawa,  Chris  Y.  Wu,  Shih-Chung  B.  Lo,  Matthew  T.  Freedman,  Seong 
K.  Mun,  "Classification  of  microcalcifications  in  digital  mammograms  for  the  diagnosis  of  breast 
cancer"  in  PROC.  SPIE  Proceedings,  Medical  Imaging  on  Image  Processing,  vol.  2710,  (#  83) 
[Received  Cum  Laude  Award  in  the  Meeting] 


Articles  Accepted  for  Publication: 

1 .  Sahiner  B,  Chan  HP,  Petrick  N,  Wei  D,  Helvie  MA,  Adler  DD,  Goodsitt  MM.  "Classification  of 
mass  and  normal  breast  tissue:  A  convolution  neural  network  classifier  with  spatial  domain  and 
texture  images,"  IEEE  Trans.  Medical  Imaging. 

2 .  Chan  HP,  Lo  SCB,  Niklason  LT,  Dceda  DM,  Lam  KL.  "Image  compression  in  digital 
mammography:  Effects  on  computerized  detection  of  subtle  microcalcifications."  Medical  Physics. 


page  17 


Articles  Submitted  for  Publication: 


1 .  Li  H,  Liu  KJ,  and  Lo  SC,  "Fractal  modeling  and  segmentation  for  the  enhancement  of 
microcalcifications  in  digital  mammograms,"  IEEE  Trans.  Med.  Imag. 

2.  Lo  SC,  Li  H,  Wang  Y,  Freedman  MT,  and  Mun  SK,  "On  optimization  of  wavelet  decomposition 
for  image  compression  and  feature  preservation,"  IEEE  Trans,  on  Image  Processing, 

3 .  Sahiner  B,  Chan  HP,  Petrick  N,  Wei  D,  Helvie  MA,  Adler  DD,  Goodsitt  MM,  "Image  feature 
selection  by  a  genetic  algorithm:  Application  to  classification  of  mass  and  normal  breast  tissue  on 
mammograms,"  Medicd  Physics 

4.  Petrick  N,  Chan  HP,  Wei  D,  Sahiner  B,  Helvie  MA,  Adler  DD,  "Automated  detection  of  breast 
masses  on  manmiograms  using  adaptive  contrast  enhancement  and  tissue  classification,"  Medical 
Physics 

5 .  Wei  D,  Chan  HP,  Petrick  N,  Sahiner  B,  Helvie  MA,  Adler  DD,  Goodsitt  MM,  "False-positive 
reduction  technique  for  detection  of  masses  on  digital  mammograms:  Global  and  local 
multiresolution  texture  analysis,"  Medical  Physics. 


Personnel  Receiving  Pay  From  This  Grant 

Shih-Chung  B.  Lo,  Ph.D. 

Matthew  T.  Freedman,  M.D. 

Akira  Hasegawa,  Ph.D. 

Yuzheng  C.  Wu,  Ph.D. 

Huai  Li,  M.S. 

Heang-Ping  Chan,  Ph.D. 

Mark  Helvie,  M.D. 

Nicoles  Petrick,  Ph.D. 

Datong  Wei,  Ph.D. 

Berkman  Sahiner,  Ph.D. 


page  18 


Pergamon 


Neural  Networks,  Vol.  8,  No.  7/8,  pp,  1201-1214,  1995 
Copyright  ©  1995  Elsevier  Science  Ltd 
Printed  in  Great  Britain.  All  rights  reserved 
0893-6080/95  $9.50 +  .00 

0893-6080(95)00061-5 


1995  SPECIAL  ISSUE 

Artificial  Convolution  Neural  Network  for  Medical  Image 

Pattern  Recognition 


Shih-Chung  B.  Lo,^  Heang-Ping  Chan,^  Jyh-Shyan  Lin,‘  Huai  Li,* 
Matthew  T.  Freedman*  and  Seong  K.  Mun* 

*  Georgetown  University  Medical  Center  and  ^  University  of  Michigan  Medical  Center 
(Received  1  November  1994;  revised  and  accepted  4  May  1995) 

Abstract — We  have  developed  several  training  methods  in  conjunction  with  a  convolution  neural  network  for  general 
medical  image  pattern  recognition.  An  unconventional  method  of  using  rotation  and  shift  invariance  is  also  proposed 
to  enhance  the  neural  net  performance.  The  structure  of  the  artificial  neural  network  is  a  simplified  network  structure 
of  the  neocognitron.  Two-dimensional  local  connection  as  a  group  is  the  fundamental  architecture  for  the  signal 
propagation  in  the  convolution  neural  network.  Weighting  coefficients  of  convolution  kernels  are  formed  by  the 
neural  network  through  backpropagated  training  for  this  artificial  neural  net.  In  addition,  radiologists*  reading 
procedure  was  modelled  in  order  to  instruct  the  artificial  neural  network  to  recognize  the  predefined  image  patterns 
and  those  of  interest  to  experts.  Our  training  techniques  involve  (a)  radiologists’  rating  for  each  suspected  image 
area,  (b)  backpropagation  of  generalized  distribution,  (c)  trainer  imposed  functions,  (d)  shift  and  rotation 
invariance  of  diagnosis  interpretation,  and  (e)  consistency  of  clinical  input  data  using  appropriate  background 
reduction  functions. 

We  have  tested  these  methods  for  detecting  lung  nodules  on  chest  radiographs  and  microcalcifications  on 
mammograms.  The  performance  studies  have  shown  the  potential  use  of  this  technique  in  a  clinical  environment.  We 
also  used  a  profile  double-matching  technique  for  initial  nodule  search  and  used  a  wavelet  high-pass  filtering 
technique  to  enhance  subtle  clustered  microcalcifications.  We  set  searching  parameters  at  a  highly  sensitive  level  to 
identify  all  potential  disease  areas.  The  artificial  convolution  neural  network  acts  as  a  final  detection  classifier  to 
determine  whether  a  disease  pattern  is  shown  on  the  suspected  image  area. 

Keywords — Neural  network.  Computer-assisted  diagnosis,  Classification  invariance  of  operations,  Output 
association  fu2zy  function.  Trainer  imposed  function. 

L  INTRODUCTION 

As  high  speed  computers  become  cost-effective  tools, 
many  scientists  have  started  to  investigate  potential 
technologies  for  computer-assisted  diagnosis  (Doi, 

1989;  Doi  et  al.,  1992).  More  and  more  digital 


Acknowledgements:  This  project  was  supported  in  part  by  a 
U.S.  Army  Grant  DAMD17-93-J-3007  and  an  American  Cancer 
Society  Grant  No.  EDT-93.  The  reviews,  opinion  and/or  findings 
contained  in  this  paper  are  those  of  the  authors  and  should  not  be 
construed  as  an  official  Department  of  Army  position,  policy  or 
decision  unless  so  designated  by  other  documentation.  The 
LABRCX^  program  was  provided  by  Dr.  Charles  Metz  of  the 
University  of  Chicago.  The  authors  are  grateful  to  the  reviewers’ 
constructive  suggestion  as  well  as  to  Ms  Susan  Kirby  and  Dr. 
Walid  Tohme  for  their  editorial  assistance. 

Requests  for  reprints  should  be  sent  to  Dr  Shih-Chung  B.  Lo, 
Radiology  Department,  Georgetown  University  Medical  Center, 
2115  Wisconsin  Avenue,  N.W.,  Suite  603,  Washington,  DC  20007, 
USA. 


imaging  systems  are  available  to  radiology  depart¬ 
ments  as  well.  It  is  known  that  conventional 
diagnostic  procedures  can  be  enhanced  by  various 
methods  through  computers.  The  applications  in 
computer-assisted  diagnosis  will  be  much  more 
meaningful  when  clinical  images  are  fully  computer¬ 
ized  and  networks  are  available  in  radiology 
departments.  Medical  diagnoses  involve  very  sophis¬ 
ticated  decision-making  processes.  Integration  of  the 
patient  information  in  a  Picture  Archiving  and 
Communication  System  (PACS)  (Horii  et  al.,  1990; 
Huang  et  al,  1990)  and  development  of  computer- 
aided  diagnosis  will  provide  radiologists  with  more 
relevant  information  to  significantly  improve  patient 
care. 

Skilled  radiologists  have  a  high  degree  of  accuracy 
in  diagnosis.  However,  there  remain  problems  in  the 
detection  of  some  diseases,  problems  that  cannot  be 
corrected  with  current  methods  of  training  and  high 
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levels  of  clinical  skill  and  experience.  These  problems 
would  cause  for  example  the  miss  rate  in  the  detection 
of  small  pulmonary  nodules,  the  detection  of  minimal 
interstitial  lung  disease  and  the  detection  of  changes 
in  pre-existing  interstitial  lung  disease.  In  this  paper, 
we  employed  a  convolution  neural  network  and 
proposed  several  training  methods  to  enhance  the 
detection  of  small  pulmonary  nodules  and  micro¬ 
calcifications  on  digital  projection  X-ray  images. 
Both  diseases  are  clinically  important  in  diagnostic 
imaging  and  are  relatively  difficult  to  identify  when 
they  are  superimposed  on  other  anatomical  struc¬ 
tures. 

Several  image  processing  techniques  have  been 
proposed  for  the  detection  of  lung  nodule:  (a) 
thresholding  and  circularity  calculation  (Giger  et 
al.,  1988),  (b)  morphological  operation  (Giger  et  al., 
1990),  and  (c)  2-D  sphere  profile  matching  technique 
(Lo  et  al.,  1993).  With  each  of  these  methods  there  is 
a  trade-off  between  increased  sensitivity  and  de¬ 
creased  specificity.  By  setting  less  stringent  criteria 
with  the  above  algorithms,  the  sensitivity  of  the 
detection  programs  would  be  relatively  high  but  false 
detection  would  also  increase.  On  the  other  hand,  a 
low  sensitivity  setting  of  the  program  would 
potentially  miss  many  true  positives.  To  use  these 
methods  for  the  detection  of  small  lung  nodules, 
additional  techniques  are  needed  to  reduce  the 
number  of  false  positives  and  maintain  high  true 
positive  detection.  A  similar  situation  was  found  in 
the  detection  of  microcalcifications  in  mammography 
(Chan,  Doi  &  Galhotra,  1987;  Chan,  Doi  &  Vybomy 
1988,  1990).  For  this  reason,  several  investigators 
have  intended  to  use  the  neural  network  as  a  classifier 
to  improve  the  detection  rate  (Lo  et  al.,  1993,  1996; 
Wu  et  al.,  1992). 

The  nets  of  the  artificial  neural  network  used  in 
conventional  backpropagation  are  fully  and  uni¬ 
formly  connected  from  one  node  of  the  upper  layer 
to  each  node  in  the  next  lower  layer.  When  applying 
this  type  of  neural  network  to  directly  perceive. image 
patterns,  the  performance  seems  rather  limited  (Lo  et 
al.,  1993).  In  some  applications  the  features  generated 
by  image  processing  techniques  were  used  for  image 
pattern  recognition.  In  observing  clinical  radiologists’ 
work,  it  is  clear  that  they  use  findings  in  the  region 
surrounding  the  suspected  area  when  identifying  the 
presence  of  a  true  disease.  We  therefore  beheve  that 
the  neighborhood  information  rather  than  non-local 
information  in  the  image  must  be  taken  into  more 
serious  consideration  during  the  neural  network 
training.  For  direct  image  input,  we  learned  that  the 
neocognitron  has  been  successfully  used  in  the 
recognition  of  characters  and  numbers  of  hand¬ 
writing  (Fukushima,  1980,  1989;  Fukushima  et  al., 
1983;  Fukushima  &  Wake,  1991).  The  neocognitron 
also  seems  likely  to  be  able  to  incorporate  informa¬ 


tion  of  the  area  surrounding  the  suspected  area  into 
its  processes  and  has  the  potential  to  deal  with 
ambiguity  in  the  information  set.  This  is  the 
motivation  to  incorporate  artificial  visual  neural 
network  in  our  research  for  medical  image  pattern 
recognition.  In  this  paper  we  propose  a  convolution 
neural  network  structure  and  several  associated 
algorithms  for  general  medical  image  applications 
when  an  abnormality  of  a  disease  pattern  can  be 
shown  in  a  small  image  area.  The  reduction  of  the 
image  area  for  each  training  or  testing  is  recom¬ 
mended  for  two  reasons:  (a)  a  large  area  demands  a 
great  deal  of  computation  and  (b)  it  potentially 
defocuses  the  features  with  which  the  user  intends  to 
train  the  neural  network. 

2.  MATERIAL  AND  METHODS 
2.1.  Fundamental  Approach 

Radiographs  for  diagnostic  medical  imaging  have 
been  used  for  many  years.  The  diagnostic  results  are 
based  on  the  visual  pattern  recognition  by  trained 
radiologists.  Throughout  this  study  we  tried  to  mimic 
the  radiologists’  diagnostic  viewing  routine.  Typically 
radiologists  scan  the  image,  looking  for  potential 
abnormalities,  then  evaluate  each  suspected  area.  In 
detecting  lung  nodules,  radiologists  first  search  for 
suspected  areas  on  the  chest  radiograph,  looking  for 
bright  round  objects  within  the  rib  cage  boundary. 
Next,  each  suspected  area  is  examined  to  compare  the 
contrast  information  of  the  bright  spot  to  the  local 
background.  Sometimes  a  radiologist  uses  several 
viewing  positions  to  look  at  the  area.  When  using  a 
workstation,  the  radiologist  may  utilize  zoom  and 
“window  and  level”  functions  to  get  different  views 
about  roundness  and  contrast  information  for  the 
suspected  areas.  The  “window”  function  takes  a 
given  digital  value  range  (e.g.,  3000)  and  rescales  onto 
a  monitor  gray  value  range  (typically  256).  The 
“level”  function  selects  the  middle  digital  value  for 
the  “window”.  Since  both  window  range  and  level 
can  be  simultaneously  operated,  the  radiologist  is 
able  to  observe  various  contrasts.  The  differentiation 
between  a  nodule  and  an  end-on  vessel  can  be  very 
difficult  for  the  human  eye  to  discern  but  is  often 
based  on  the  presence  of  projections  from  the  round 
shape  and  its  relative  contrast  compared  to  the 
background  and  to  other  vessels.  For  the  detection 
of  microcalcifications,  radiologists  use  similar  view¬ 
ing  steps.  The  main  differences  between  the  detection 
of  lung  nodules  and  microcalcifications  are  the 
disease  patterns,  clinical  indications,  and  experience. 

The  radiologist  diagnostic  viewing  steps  described 
above  were  modelled  and  were  converted  to 
computer  algorithms.  The  detail  algorithms  and 
techniques  for  the  pre-scan  were  previously  de- 
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scribed  by  Lo  et  al.  (1993)  for  the  detection  of  lung 
nodule  on  chest  radiographs  and  by  Chan  and 
coworkers  (1987,  1988,  1990,  1995)  for  the  detection 
of  microcalcifications  on  mammograms.  In  this 
paper,  we  concentrate  mainly  on  the  proposed 
convolution  neural  network  and  methods  to  adjust 
and  arrange  the  input  and  output  signals  in  order  to 
achieve  maximum  efficiency. 

2.2.  The  Convolution  Neural  Network 

Based  on  the  pre-scan  methods,  we  set  the  computer 
programs  to  a  highly  sensitive  level  to  extract  possible 
objects  which  included  all  true-positive  detections  as 
well  as  a  large  number  of  false-positive  detections. 
Differentiation  of  false  positives  from  true  positives  is 
the  remaining  issue.  We  propose  to  use  the  trained 
convolution  neural  network  (CNN)  as  the  final 
classifier  to  carefully  study  each  suspect  area  in  the 
second  phase  of  the  diagnostic  process.  The  proposed 
CNN  can  be  considered  a  simplified  vision  machine 
designed  to  perform  the  second  part  of  the  disease 
detection  study  for  the  classification  into  disease  and 
non-disease.  This  neural  network  is  based  on  the 
network  structure  of  neocognitron  (Fukushima  et  al., 
1983)  which  is  designed  to  simulate  the  vision  of 
vertebrate  animals.  We  believe  that  the  CNN  should 
be  suitable  for  general  medical  image  pattern 
recognition. 

Before  entering  an  input  matrix  into  the  neural 
network,  we  employed  a  background  reduction 
method  (see  Section  2.3.1)  to  mimic  the  function  of 
“window  and  level”.  This  image  function  has  been 
widely  used  in  clinical  workstations.  It  is  utilized  to 
adjust  the  overall  brightness  of  the  image  so  that 
nodules  of  the  same  size  would  have  similar  contrast 
when  compared  to  the  background.  In  a  way,  it  can 
help  minimize  the  contrast  variation  of  the  disease 
pattern.  In  this  study  the  background  of  all  the 
suspected  image  blocks  was  reduced  for  the  CNN 
training  and  testing.  The  purpose  of  using  the  two- 
dimensional  convolution  operation  is  to  simulate 
radiologists’  viewing  of  a  suspected  area.  In  other 
words,  we  instructed  the  neural  net  to  utilize  the 
information  on  both  the  center  of  the  image  block 
and  its  neighborhood  and  to  train  the  neural  network 
to  extract  necessary  local  features  through  the 
supervised  backpropagation  training.  On  the  output 
side,  we  tried  to  educate  the  neural  net  by  simulating 
the  radiologists’  decision  making  process.  To  model 
radiologists’  interpretation  of  an  image  area  with  a 
certain  probability  of  abnormality,  a  method  of 
utilizing  fuzzy  output  association  is  proposed  in 
Section  2.3.3  to  turn  this  kind  of  clinical  measure  into 
information  readable  by  a  neural  network. 

2.2.1.  The  Structure  of  the  Proposed  Convolution 
Neural  Network.  The  CNN  is  a  simplified  version  of 


the  neocognitron.  Since  there  is  no  theory  to 
indicate  what  is  the  best  neural  network  structure 
for  medical  image  pattern  recognition,  we  started 
our  studies  by  using  two-level  and  three-level 
neocognitron  structures.  We  did  not  use  complex- 
layer  and  did  not  extend  our  study  beyond  a  three- 
level  structure  due  to  computation  constraints. 
Nets  between  two  adjacent  levels  (layers)  are 
selectively  interconnected  across  groups.  The  for¬ 
ward  propagation  algorithm  was  developed  for 
non-supervised  training  in  the  original  neocognitron 
method.  The  supervised  training  from  one  layer  to 
the  next  (i.e.,  layer  training)  also  proposed  by 
Fukushima  and  Wake  (1991)  for  handwriting 
recognition  may  be  applied  to  the  classification  of 
disease  patterns  but  will  not  be  discussed  in  this 
paper.  Instead  we  used  a  convolution  constrained 
neural  network  with  the  well  known  backpropaga¬ 
tion  method  for  training.  Figure  1  shows  the  global 
three-level  structure  of  this  neural  network. 

We  group  the  operations  of  kernels  and  the  image 
block  in  such  a  way  that  the  center  of  the  suspected 
nodule  area  is  separated  from  the  surrounding  region. 
Basically  each  group  in  the  receiving  layer  receives 
signals  from  two  groups  of  weights  (e.g.,  kernels). 
The  kernels  operating  in  the  surrounding  areas  are 
referred  to  as  peripheral  kernels  and  the  kernels 
operating  in  the  central  areas  as  inner  kernels.  This 
arrangement  is  specifically  designed  for  image  blocks 
containing  a  suspected  tumor.  In  such  a  case,  the 
bright  spot  is  located  relatively  in  the  central  area 
indicated  by  the  pre-scan  procedures.  The  purpose  of 
using  the  dual-kernel  is  to  instruct  the  peripheral  and 
the  inner  kernels  to  learn  different  image  patterns. 
However,  for  those  tasks  not  involving  the  recogni¬ 
tion  of  round  objects,  the  use  of  a  single  kernel  is 
recommended.  In  the  following  experiment,  we  use 
dual-kernel  and  single  kernel  for  detection  of  lung 
nodules  and  microcalcifications,  respectively.  For  the 
forward  signal  propagation,  the  resultant  of  the 
weighting  factors  of  the  kernel  convoluting  the 
element  values  of  the  front  layer  are  collected  into 
the  corresponding  matrix  elements  of  the  receiving 
layer.  This  operation  accounts  for  the  major 
difference  between  the  convolution  type  neural 
network  and  a  regular  fully  connected  neural 
network.  The  collected  value  at  each  element  is 
further  operated  with  a  sigmoid  function  in  the 
forward  propagation  as  it  functions  in  an  ordinary 
forward  propagation  neural  network  system. 

Each  suspected  image  block  of  32  x  32  pixels 
indicated  in  the  pre-scan  program  is  extracted  as  an 
object  for  CNN  classification.  Due  to  the  long 
training  time  of  the  computer  using  the  CNN 
algorithm,  every  four  pixels  in  a  2  x  2  square  were 
averaged  into  one  pixel  so  that  each  image  block  was 
reduced  to  16  x  16  pixels.  We  used  an  array  size  of 
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FIGURE  1.  An  artificial  convolution  neural  network  with  dual-kernel  for  lung  nodule  detection. 


5  X  5  for  both  inner  and  peripheral  kernels  between 
layers.  The  first  hidden  layer  consists  of  12  groups. 
Each  group  has  12x12  pixels  formatted  in  a  square 
array  where  the  center  8x8  pixels  and  outer  area 
covering  by  two  pixels  along  the  side  are  contributed 
by  the  inner  and  peripheral  kernels,  respectively.  The 
second  hidden  layer  also  consists  of  12  groups.  Each 
group  has  8x8  pixels  where  the  center  6x6  area 
and  outer  area  covering  only  one  pixel  along  the  side 
are  contributed  by  the  corresponding  inner  and 
peripheral  kernels,  respectively.  The  output  layer 
has  10  nodes  (groups)  which  are  fully  connected  to 
the  second  hidden  layer.  However,  in  the  experiment 
involving  the  detection  of  microcalcifications  de¬ 
scribed  later,  no  peripheral  kernel  was  used.  This  is 
because  most  microcalcifications  are  concentrated  on 
a  few  pixel  regions  while  using  a  digitization  pixel  size 
of  105  /im. 

It  is  important  to  realize  that  the  total  number  of 
nodes  needed  in  the  hidden  layers  somewhat  depends 
on  the  total  number  of  training  samples.  Since  we 
plan  to  expand  our  database  and  the  use  of  rotated 
versions  of  an  input  matrix,  we  expect  that  our 
training  samples  will  be  very  large  in  the  future.  The 
number  of  layers  used  should  depend  upon  the 
sophistication  of  the  features  that  the  neural  network 
is  intended  to  perceive.  The  more  complicated  the 
disease  patterns,  the  more  layers  are  required  to 
distinguish  high  order  information  of  image  struc¬ 
tures.  The  convolution  kernels  are  organized  in  such 
a  way  as  to  emphasize  a  number  of  image  characteris¬ 
tics  rather  than  those  less  correlated  values  obtained 
from  feature  spaces  for  input.  These  characteristics 


are:  (a)  the  horizontal  versus  vertical  information; 
(b)  local  versus  non-local  information;  and  (c)  image 
processing  (filtering)  versus  signal  propagation. 

2.3.  Image  Processing  and  Training  Mefiiods 

An  appropriate  neural  network  structure  is  an 
important  working  base  to  form  a  signal  propaga¬ 
tion  platform  in  a  given  recognition  task.  The 
training  materials  and  methods,  which  provide 
intellectual  information  for  the  construction  of  the 
knowledge,  are  essential  for  the  performance  of  the 
neural  network.  We  believe  that  the  success  of  using 
the  neural  network  relies  not  only  on  the  network 
structure  but  also  on  the  sufficient  training  informa¬ 
tion  and  effective  training  methods.  This  study 
demonstrates  our  approaches  to  convert  expert 
knowledge  into  computer  readable  information, 
which  is  the  key  issue  in  terms  of  training.  In  this 
experiment,  we  provided  the  network  with  all 
possible  radiological  diagnostic  information  and  set 
up  the  studies  by  adding  one  method  at  a  time  to 
optimize  the  neural  network  performance. 

2.3.1.  Background  Reduction  for  Suspected  Image 
Blocks.  We  found  that  the  consistency  of  input 
matrix  contrast  is  an  important  factor  in  stabilizing 
the  neural  network  learning.  In  our  experiment,  the 
neural  network  did  not  reach  a  solution  for  the  image 
blocks  provided  in  the  training  sets,  even  though  all 
suspected  nodule  blocks  were  corrected  for  back¬ 
ground  trend  and  relatively  centered  in  terms  of 
brightness.  Their  contrasts  (the  difierence  between 
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the  center  and  peripheral  brightness)  are  unevenly 
distributed  due  to  a  variation  of  X-ray  exposure  and 
different  sensitivity  of  the  films.  In  addition,  the 
image  block  may  involve  many  superimposed  back¬ 
ground  structures,  namely  vessels,  ribs,  and  the  heart. 
Separating  nodules  or  suspected  round  objects  from 
chest  structures  is  not  an  easy  task.  We  concluded 
that  elimination  of  some  background  information 
and  enhancement  of  contrast  information  are 
necessary  procedures  to  assist  the  neural  network  in 
the  recognition  of  disease  patterns.  We  designed  a 
background  subtraction  technique  to  simulate 
“window  and  level”  function,  which  is  clinically 
useful  for  enhancing  disease  patterns.  In  fact,  we  only 
used  “level”  function  and  ignored  “window” 
function.  A  fixed  “window”  function  may  distort 
and  mix  the  contrast  information  that  exists  between 
nodules  and  end-on  vessels.  For  the  “level”  function, 
gray  values  in  each  image  block  are  uniformly 
subtracted  from  the  calculated  background  by 
averaging  the  outer  ring  area. 


for  f{x,y)  >  B  and  (x,y)  is  inside  the  circle 

fs(.x,y)  =  •{  0 

tot f(x,y)  <  Bor 

is  on  or  outside  the  circle, 

(1) 


where 


E  AXn,yn) 

/t  €  the  ring 


for  circular  object  detection  and  C  equals  the  number 
of  pixels  in  the  ring.  Figure  2  shows  that  heavily- 
shaded  pixels  are  used  in  the  calculation  for 
background  averaging.  Both  heavily  and  lightly 


FIGURE  2.  A  32  X  32  Image  block.  The  white  area  is  the  area  of 
interest  for  image  pattern  recognition  using  convolution  neurai 
network.  Origlnai  pixei  values  in  the  heavily  shaded  area  are 
averaged  as  a  background  value. 


shaded  areas  are  given  in  pixel  value  of  0.  Our 
studies  indicated  that  this  ring  area  averaging  method 
produces  better  results  than  the  peripheral  area 
averaging  method.  This  may  be  due  to  the  fact  that 
the  ring  area  is  closer  to  the  central  area  and 
possesses  greater  background  information  than  the 
entire  peripheral  area. 

For  the  detection  of  non-circular  objects,  the 
background  value  should  be  obtained  by  averaging 
pixel  values  on  the  frame.  Each  gray  value  from  the 
background-reduced  image  block  is  one-to-one 
transferred  to  a  node  of  the  input  layer  for  the 
neural  net  processing.  These  signals  received  at  the 
input  layer  are  equivalent  to  the  light  signals  received 
by  the  retina  as  far  as  the  vision  type  neural  network 
is  concerned. 


2.3.2.  Backpropagation  Training,  The  main  difference 
between  conventional  weights  and  kernel  weights  is 
that  conventional  weights  are  independent  and  kernel 
weights  are  constrained  by  grouping.  We  believe  that 
the  latter  method  is  more  powerful  than  the  former 
method  for  direct  image  pattern  recognition.  In 
addition,  the  trained  kernels  can  be  analyzed  to 
understand  what  features  were  learned  during  the 
training.  This  design  would  allow  researchers  to 
further  investigate  the  artificial  neural  network 
learning.  Training  requires  many  iterations  for  the 
network  to  obtain  solutions  for  all  weights  applied  to 
the  propagation  while  the  error  function  reaches  a 
minimum  value. 

By  looking  at  the  CNN  processing,  one  may  find 
that  signals  are  filtered  and  modulated  as  in  a 
complicated  circuit  system.  Signal  propagation  from 
one  layer  to  the  next  is  composed  of  a  two-step 
calculation:  (a)  adaptive  convolution  combiner  and 
(b)  an  activation  function  (a  sigmoid  function  is  used 
in  this  study)  which  is  given  below: 

Sx({iJ)\n) 

_ _ 1 _ 

1  +  expj  -  ^  %{{u,  v);  n,  m))  ®Sx-i  ((/j);  w))]  | 

(2) 


or 


1  -h  exp 


t  M,  v;  m 

v);m))]| 


,  (3) 


x5x-i((«- 

where  Sxi{iJ);n)  represents  the  signal  at  node  (i, /), 
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/1th  group,  and  x  layer;  kx[{u,v)\n,m)  denotes  the 
weighting  factor  value  of  net  («,  v)  in  the  /ith  group  of 
the  X  —  /  layer  which  connects  the  /wth  group  of  the  x 
layer. 

The  error  function  which  is  expected  to  reach  a 
local  minimum  through  the  error  backpropagation 
training  can  be  given  as: 


^  =  (4) 

where  y{no)  and  5o(no)  are  the  target  output  and 
calculated  output  signals  for  output  node  rio, 
respectively  and  T  is  the  total  number  of  output 
nodes.  Based  on  eqns  (3)  and  (4),  the  iterative  version 
of  kernel  weights  derived  by  the  generalized  delta  rule 
is  given  as: 

kx{{u,  v);/i,m)[/+  1] 

=  kx((M,v);n,m)[/] 

+ »?  E  “  “’j 

ij 

+  aAM(w,v);/i,/w)M,  (5) 

where  t  is  the  iteration  number  during  the  training,  77 
is  the  gain  for  the  current  weight  changes,  a  is  the 
gain  for  the  momentum  term  received  in  the  last 
learning  loop,  and  6  is  the  weight-update  function 
which  is  given  as: 

=  Sxi{iJ)]n)[\  -  SxiiJ);n)]Q:c{{iJ);n)  (6) 

and 

Qxi{iJ);n)=  ^  A:;c+i((w,v);/z,/w)  x6;c+i((/  +  MJ  +  v);/n). 
u,  v;  m 

For  the  output  layer, 

QE 

=  [So{no)  -  y(/io)]‘S'o(«o)[l  -  (7) 

where  o  denotes  the  output  layer.  In  this  study,  all 
weighting  factors  including  the  kernels  were  initially 
given  a  normalized  random  number.  The  normal¬ 
ization  is  based  on  the  number  of  nets  connecting  to  a 
destination  node  in  the  next  layer. 

2.3.3.  Neural  Network  Output  Assignment  Using 
Radiological  Diagnostic  Rating.  The  design  of  the 
output  layer  for  the  medical  diagnostic  decision  is  not 


as  straightforward  as  in  other  applications.  Our  goal 
is  to  distinguish  non-disease  patterns  from  disease 
patterns.  We  can  classify  the  data  set  in  two 
categories.  However,  this  is  probably  not  an  optimal 
design  for  the  output  layer.  In  some  obvious  cases, 
radiologists  are  able  to  make  a  clear  diagnostic 
indicatior  of  a  disease  shown  on  an  image.  Often  they 
work  wiiii  different  degrees  of  sensitivity  (different 
levels  of  suspicion)  depending  on  the  clinical 
situation.  Thus  they  estimate  the  likelihood  that  a 
radiograph  or  an  area  of  a  radiograph  may  possess  a 
disease.  For  the  neural  network,  it  may  be  more 
realistic  to  define  the  output  in  terms  of  probability. 
Depending  upon  the  number  of  output  nodes  used, 
the  arrangement  of  output  nodes  and  the  probability 
associated  with  a  score  varies.  Intuitively,  one  can 
proportionally  scale  scores  onto  node  numbers. 

Although  the  above  output  node  assignment 
follows  the  general  diagnostic  decision  rule  used  by 
many  radiologists,  one  output  node  has  no  relation  to 
any  other.  No  output  node  relation  will  be  passed  to 
the  neural  network  for  the  training.  To  circumvent 
this  problem,  we  propose  to  use  a  narrow  output 
distribution  to  establish  a  fuzzy  association  between 
the  adjacent  output  nodes.  In  fact,  when  a  radiologist 
determines  a  specific  probability  of  a  disease  pattern 
in  an  image  area  based  on  his/her  training  and 
experience,  this  probability  would  be  accompanied  by 
a  variation.  We  modelled  this  probability  with  a 
generalized  distribution  (Szepanski,  1980)  in  the  score 
space. 

G[(T,y,p)  =  (8) 

where  v  is  the  distance  from  a  given  score, 


The  reason  for  modelling  a  generalized  distribution  is 
that  we  do  not  know  exactly  what  kind  of 
distribution  can  represent  the  radiologists’  interpreta¬ 
tion  in  various  diseases.  When  //  >  2,  the  distribu¬ 
tions  may  be  too  flat  which  probably  is  not  the  case 
with  highly  experienced  pulmonary  radiologists.  For 
simplicity,  we  use  p  >  2  for  Gaussian  distribution  in 
the  experiment. 

In  addition  to  the  distribution  function,  the  trainer 
can  impose  a  driving  function,  r{v,s),  onto  it  to 
indicate  the  belonging  of  the  determination  category, 
where  s  denotes  the  strength  of  the  repulsion 
introduced  by  the  user.  An  example  of  the  trainer 
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boundary  between  true  & 
false  nodules  for  training 


Radiologist’s  judgment  D: 
(scaled  disease  probability) 


Output  node 
d: 


0  12  3 


0  12  3  4 


6  7 


8  9 


FIGURE  3.  A  fuzzy  output  association  Is  constructed  by  Gaussian  distribution  and  repulsive  functions.  (Note  this  drawing  is  not  in  scale. 
Only  one  curve  Is  used  for  a  training  case.) 


imposed  driving  function  for  scores  indicating 
positive  determination  is  given  below: 


f  1  for  V  ^0 

\  5v  +  1  for  V  >  0. 


(9) 


For  those  scores  associated  with  negative  determina¬ 
tion,  the  repulsion  of  eqn  (9)  should  be  changed  to 
the  opposite  direction.  Therefore,  the  output  associa¬ 
tion  functions  at  a  single  node  for  a  score  indicating 
an  image  block  involving  disease  and  for  a  score 
indicating  a  disease-free  image  block  are: 


Ah  =  Kx  G{<7^  v,p)  X  r(v, s)  (10) 


and 


Al  =  K  X  G{a,v,p)  X  r{-VyS)y  (11) 


respectively.  For  the  extreme  scores  in  the  score  space 
(i.e.,  minimum  and  maximum  scores),  the  use  of  a 
delta  function  is  recommended. 

In  the  lung  nodule  detection  studies,  we  assigned 
scores  for  all  image  blocks  for  training.  Based  on  the 
score  which  corresponds  to  an  output  node,  a 
Gaussian  distribution  (/?  =  2)  with  a  standard 
deviation  of  o-  =  0.55,  an  output  scaling  constant  of 
K  =  2.5  and  a  repulsive  strength  of  .s  =  1.5  for  the 
asymmetric  output  association  (.y  =  0  for  the  sym¬ 
metric  output  association)  were  used  for  correlating 
the  adjacent  scores.  For  the  neural  network  output,  we 
used  a  discrete  form  of  the  score.  We  estimate  that  it 
would  take  a  great  deal  of  computation  time  for  100 
nodes  or  more  in  the  output  layer.  Realistically,  10 
discrete  output  nodes  are  proposed  for  the  classifica¬ 
tion.  We  assigned  nodes  0-3  to  correspond  to  definitely 
negative-possibly  negative;  detection  nodes  6-9 
correspond  to  possibly  positive-definitely  positive 
detection.  Nodes  4  and  5  are  not  used  for  decision 
buffering.  During  the  experiment,  we  collected  all  the 
suspected  nodes  in  two  categories  (i.e.,  true  nodule  and 


non-nodule).  In  the  course  of  rating  for  the  training  set, 
a  senior  radiologist  scored  each  suspected  nodule 
based  on  his  clinical  knowledge.  Pathologically  proven 
truth  (either  has  a  nodule  or  not)  of  training  case  was 
also  provided  to  assist  in  the  radiologist’s  rating. 

Figure  3  shows  all  the  asymmetric  output 
association  distributions  corresponding  to  a  radi¬ 
ologist’s  judgement.  However,  only  one  curve  was 
used  for  each  judgement  with  a  suspected  image 
block.  Figure  3  also  highlights  a  case  when  score  7  is 
determined.  In  this  situation,  output  node  7  received 
the  highest  activation  (1.0),  node  8  received  the 
second  highest  activation  (0.5),  node  6  receives  some 
activation  (0.2),  and  remaining  nodes  receive  no 
activation. 

Two  examples  of  output  assignments  associated 
with  probability  in  discrete  form  are  given  below: 

(a)  Symmetric  output  assignment: 


'0.2 

for  9  >  Z)  >  5  and  </=£)+! 

or  for  0  <  Z)  <  4  and  d  =  D  —  \ 

A{d,D)  =  < 

1.0 

foTD  =  d 

(12) 

0.2 

for  9  >Z)  >  5  and  d  —  D  —\ 

k. 

or  for  0  <4  and  d  =  D  -\-l 

Otherwise  A{d,  D)  =  0. 

(b)  Asymmetric  output  assignment: 

'0.5 

for  9  >  Z)  >  5  and  d  =  D-\-\ 

or  for  0  <  Z)  <  4  and  if  =  Z)  —  1 

A{d,D)  =  < 

1.0 

for  Z)  =  if 

(13) 

0.2 

for  9  >  5  and  d=  D-\ 

or  for  0  <  Z)  <  4  and  d=^  D-^\ 

Otherwise  A{d,  D)  —  0. 
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The  use  of  asymmetric  output  assignments 
attempted  to  push  (train)  the  non-disease  pattern 
toward  low  score  nodes  and  to  instruct  the  disease 
pattern  toward  high  score  nodes.  With  this  output 
assignment  for  the  output  node  in  the  training,  the 
adjacent  node  relation  is  also  established.  This 
supervised  training  can  be  generally  applied  to  any 
situation  where  association  of  outputs  is  necessary. 


2.4.  Qassification  Invariance  of  Matrix  Operations 

The  use  of  moment  invariance  via  rotation  and  shift 
has  been  proposed  for  applications  in  graphic  pattern 
recognition.  The  direct  use  of  this  method  as  a 
classifier  may  not  be  suitable  for  those  image  patterns 
possessing  circular  symmetric  property  (e.g.,  nodule) 
or  lacking  a  fixed  geometric  pattern  (e.g.,  calcifica¬ 
tion). 

Often  medical  image  pattern  recognition  does  not 
concern  “top-down”  or  “left-right”  as  classification 
criteria.  In  such  a  case  we  can  take  advantage  of  this 
characteristic  as  an  invariance.  In  <  .her  words,  we 
propose  to  rotate  and/or  to  shift  the  input  vector  and 
maintain  the  same  output  assignments  for  the 
training.  This  method  may  affect  the  neural  network 
in  two  ways:  (a)  by  instructing  the  neural  network 
that  the  rotation  and  shift  of  the  input  vector  would 
receive  the  same  w  issification  result;  and  (b)  by 
increasing  the  total  number  of  training  samples  which 
is  expected  to  enhance  the  performance  of  the  neural 
network. 

Using  the  center  pixel  as  the  origin,  a  standard 
rotation  and  shift  of  the  image  block  was  used; 

’jc]  _  rcos((/>)  sin(0) 

,y  \  cos(0) 

where  (f)  is  rotation  angle  of  the  origin  (center  of  a  32 
X  32  image  block);  Ax  and  A>^  are  shifts  in  the  x  and 
y  directions,  respectively. 

In  this  work,  we  only  rotated  each  input  matrix 
eight  times  to  test  our  hypothesis.  Four  of  the 
rotations  are: 


+ 


Ajc 

AyJ 


(14) 


{0°,90°,180°,270°}.  (15) 

We  also  flipped  over  (left-right)  the  original  image 
matrix  and  used  the  above  rotations  again  to  obtain 
four  additional  rotations.  This  type  of  rotation  would 
only  reposition  pixel  values.  No  interpolation 
calculation  of  pixel  values  was  involved.  We  believe 
that  other  rotations  and  minor  shifts  are  also  valid 
methods  for  the  use  of  classification  invariance  of 
operations.  Rotation  may  require  interpolation 
which  would  slightly  alter  the  pixel  values  and 
should  be  acceptable  for  the  input  of  the  CNN. 


However,  the  use  of  shifting  can  be  complicated, 
because  it  involves  (a)  how  important  the  center 
information  for  disease  patterns  are  in  the  neural 
network  learning  and  (b)  how  much  shifting  can  be 
used  without  sacrificing  critical  portions  of  image 
information. 


2.5.  Classification  of  Output  Values  in  the  Testing 

We  assigned  scores  with  a  narrow  asymmetric  peak 
distribution  on  the  output  nodes  for  the  training  in 
order  to  associate  each  node  with  its  adjacent  node. 
We  believe  that  the  distribution  assignment  is  not  a 
unique  method  to  link  rating  score  relations. 
However,  the  output  relation  information  must  be 
passed  to  the  neural  network  for  learning.  This 
relationship  does  not  exist  in  recognition  for 
characters  or  Arabic  numbers.  In  those  applica¬ 
tions,  each  node  is  independent  from  others. 

After  the  training  a  typical  output  pattern  will  be 
very  close  to  the  corresponding  perfect  pattern  (the 
assigned  narrow  asymmetric  peak  distribution)  for 
most  of  the  training  cases.  In  the  case  of  testing, 
many  of  them  have  different  output  signal  patterns.  It 
is  not  a  simple  task  to  interpret  what  the  representa¬ 
tion  of  each  output  pattern  means  if  the  testing 
output  does  not  follow  an  output  pattern  assigned  to 
the  training.  Corresponding  to  the  grading  system 
arranged  in  the  training,  a  polarized  (linearly 
weighted)  function  is  given  as  an  indication.  With 
this  we  can  define  a  normalized  disease  detection 
index  (NDDI)  for  the  judgement  of  a  suspected  area: 

[0„xin-{N-l)/2)\ 

NDDI  =  -  (16) 

'£[0„]x{N-1)/2 

11  =  0 

where  n  denotes  the  node  in  the  output  layer,  0„  is 
the  output  value  at  node  n,  and  N  is  the  total  number 
of  output  nodes.  Hence  a  nodule  detection  index  of  0 
or  near  0  indicates  a  definite  non-nodule  and  a  nodule 
detection  index  of  1  or  greater  implies  a  definite 
nodule  case  with  the  judgement  of  the  neural 
network.  The  reason  for  the  weighting  is  that  the 
score  line  is  centered  at  {N  -  l)/2  (i.e.,  4.5  for  10 
nodes  in  the  output  layer)  and  polarization  of  true 
and  false  depends  on  the  position  of  the  nodes. 
Equation  (16)  is  the  net  effect  of  all  output  nodes. 

We  do  not  recommend  using  a  detection  index  of  0 
as  a  cut-off  point  to  determine  a  disease  or  a  disease- 
free  image  block  using  the  trained  neural  network. 
The  cut-off  point  may  be  shifted  by  the  inevitable  bias 
in  the  training  cases.  In  practice  the  cut-off  point  is 
established  by  many  clinical  cases  in  a  rigorous 
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evaluation  study.  In  this  paper  a  pre-clinical 
performance  study  was  conducted  and  is  discussed 
below. 

2.6.  Performance  Evaluation  of  The  Convolution 
Neural  Network 

Receiver  operating  characteristic  (ROC)  is  an 
analytical  method  generally  applied  to  the  perfor¬ 
mance  evaluation  of  a  system.  In  an  ROC  analysis, 
the  distributions  of  the  normal  and  abnormal  cases 
may  be  represented  by  binormal  distributions  (Swets 
&  Pickett,  1982).  When  the  two  distributions  overlap 
on  the  decision  axis,  a  cut-ofif  point  can  be  made  at  an 
arbitrary  decision  threshold.  The  corresponding  true¬ 
positive  fraction  (TPF)  versus  false-positive  fraction 
(FPF)  for  each  threshold  can  be  indicated  in 
Cartesian  coordinates.  By  marking  several  points  on 
the  plot,  curve  fitting  can  be  employed  to  construct 
an  ROC  curve.  The  area  vmder  the  curve  referred  to 
as  Az  can  be  read  as  a  performance  index  of  the. 
system  using  ROC  analysis.  In  general  the  higher  the 
Az,  the  better  the  performance.  A  computer  program 
(LABROC)  using  two  sets  of  data,  one  for  true  and 
the  other  one  for  false  categories,  is  employed  for  the 
analysis  of  the  NDDIs  derived  from  neural  net 
outputs. 

3.  EXPERIMENTS  AND  RESULTS 

3.1.  Detection  of  Lung  Nodules  on  Digital  Chest 
Radiographs 

Chest  radiographs  in  patients  with  primary  and 
metastatic  cancer  and  with  one  or  several  lung 


nodules  are  converted  into  digital  form  using  a  laser 
film  digitizer  (Konica  Laser  Film  Scanner  Model: 
KDFR-S;  Tokyo,  Japan).  About  one  third  of  chest 
images  were  acquired  from  a  computed  radiographic 
system  (AGFA  ADC  prototype  computed  radio¬ 
graphy;  Mortsel,  Belgium).  The  digital  data  are 
transmitted  and  stored  in  our  PACS  until  needed 
for  the  research  project.  The  images  were  then 
retrieved  to  a  high  speed  workstation  and  the 
computer  searches  were  used  sequentially:  a  thresh¬ 
olding  evaluation,  use  of  background  reduction,  a 
test  of  profile  matching  rate,  and  neural  network 
classification. 

The  pre-scan  process  was  performed  first  to  locate 
the  center  of  the  island  and  isolate  the  image  block 
for  training.  The  pre-scan  program  was  running  in  a 
highly  sensitive  mode  with  a  matching  rate  (MR)  of 
0.7  for  all  images  involved  in  the  training.  Suspected 
image  blocks  included  various  types  of  rib  crossing, 
and  various  sizes  of  end-on  vessels  and  vessel  clusters. 
The  true-positive  nodules  may  also  overlap  with  lung, 
vessels,  and  rib  structures.  Figure  4  shows  some 
randomly  sampled  suspected  image  blocks  which 
were  background-reduced  and  contrast-balanced  for 
display  purposes.  These  image  blocks  were  mirrored 
and  rotated  90°,  180°,  and  270°  for  the  training.  Note 
that  each  original  and  its  seven  “brother”  image 
blocks  share  the  same  score  vector  (probability  of  a 
disease  and  output  fuzzy  association).  During  the 
training,  the  original  and  its  seven  “brother  image 
blocks  as  a  group  were  entered  in  the  same  sequence. 

During  the  training  we  found  that  the  error- 
function  did  not  monotonously  decrease  for  each 
learning  epoch.  However,  the  overall  errors  decreased 
throughout  many  iterations.  We  did  not  completely 


FIGURE  4  The  upper  four  rows  show  64  nodule  blocks  sampled  from  Ihe  dafabase.  Each  Image  block  on  rows  5  and  6  contains  no 
nodule  bul  lung  or  rib  structure.  Each  image  block  on  the  bottom  two  rows  contains  an  end-on  vessel. 
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False-Positive  Fraction 

FIGURE  5.  Three  ROC  curves  representiny  the  performance  of 
(a)  CNN  without  output  association  and  rotation  (plain  line,  the 
average  Az  =  0.77),  (b)  CNN  using  a  Gaussian  output  associa¬ 
tion  in  the  training  (dashed  line,  the  averave  Az  -  0.83),  and  (c) 
CNN  with  asymmetric  output  association  and  eight  rotated  Input 
matrices  (bold  line,  the  average  Az  =  0.88). 

retrain  the  kernels  for  different  assignments  in  the 
output  layer.  Instead,  based  on  the  trained  kernels  we 
continued  to  train  the  neural  network  with  additional 
conditions.  The  sequence  is:  (a)  CNN  with  symmetric 
output  association,  (b)  use  of  trainer  imposed  driving 
function,  and  (c)  rendering  seven  “brother”  images 
for  training. 

The  database  had  55  chest  radiographs  and  only 
25  images  contained  at  least  one  nodule.  In  the  pre¬ 
scan,  52  nodules  and  155  non-nodules  were  extracted 
from  all  55  images.  All  cases  were  confirmed  by 
biopsy  or  by  follow-up  showing  growth  of  the 
nodule.  In  this  study,  we  employed  a  grouped 
jackknife  method  (Fukunaga  &  Hayes,  1989)  to 
evaluate  the  performance  of  the  CNN.  We  randomly 


selected  28  images  for  training  and  the  other  27 
images  for  testing  in  the  study.  Final  ROC  curves 
were  obtained  by  averaging  the  results  from  30 
grouped  jackknife  experiments.  The  results  obtained 
from  the  tests  were  very  encouraging.  Figure  5  shows 
the  improvement  of  using  a  convolution  neural 
network  and  corresponding  enhancement  techniques 
using  output  association  and  classification  invariance 
of  matrix  operation  for  the  input. 

In  this  experiment,  we  found  that  the  average  Az 
was  0.77  using  the  CNN  with  a  delta  function  for 
output  determination,  and  was  0.83  using  the  CNN 
with  a  narrow  Gaussian  distribution  for  output 
association.  Using  a  Gaussian  output  association 
and  eight  types  of  rotated  image  blocks  for  input,  we 
found  that  the  Az  was  increased  to  0.87.  After  a 
trainer  imposed  function  was  added,  we  obtained  an 
insignificant  increase  of  Az  to  0.88.  From  the  ROC 
curve  corresponding  to  Az  =  0.88,  we  found  that  the 
CNN  reduced  79%  of  false-positive  detections 
equivalent  to  2-3  false  nodule  detections  per  image 
and  preserved  80%  of  true-positive  detections. 

We  also  tested  the  same  database  using  two  nodes 
in  the  output  layer.  In  such  a  case,  no  output 
association  can  be  used.  The  CNN  achieved  an 
average  Az  of  0.83  when  eight  input  matrices  shared 
the  same  diagnostic  interpretation  (true  or  false). 

3.2.  Detection  of  Microcalcifications  on  Digital 
Mammograms 

We  also  evaluated  the  use  of  CNN  in  the  detection  of 
subtle  microcalcifications.  A  total  of  68  mammo¬ 
grams  (only  38  of  them  consisted  of  subtle 


FIGURE  6.  Each  Image  block,  extracted  from  the  mammogram,  on  the  upper  four  rows  contains  at  least  one  calcification.  Each  image 
block  on  the  bottom  four  rows  contains  at  least  a  local  maximum  value  of  gray  scale  (bright  spot)  that  is  not  a  calcification.  Each  block  at 
matrix  elements  (1,4),  (5,4),  7,4),  (9,4),  and  (2,6)  contains  a  bright  spot  due  to  a  film  defect 
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microcalcifications)  were  digitized  by  a  laser  scanner 
with  a  pixel  size  of  0.105  nun.  The  initial  search  prior 
to  the  final  interpretation  by  the  neural  network 
follows  the  basic  scheme  which  uses  background 
removal  and  signal  extraction  methods  to  pre-scan 
the  mammograms  and  to  extract  all  possible 
suspected  areas  (Chan  et  al.,  1988,  1990,  1991). 
After  the  pre-scan  process  by  the  computer  program, 
the  68  digital  mammograms  provide  265  true  and 
1821  false  subtle  microcalcifications.  Figure  6  shows 
some  of  the  suspected  regions  which  may  or  may  not 
contain  microcalcifications. 

Prior  to  the  CNN  process,  the  background  of  all 
the  image  blocks  were  removed  using  a  wavelet  high- 
pass  filtering  technique  instead  of  using  the  circular 
averaging  method  described  in  Section  2.3.1  where 
lung  nodule  detection  was  the  objective.  Specifically, 
after  extracting  each  suspected  region  from  the 
original  digital  mammogram,  a  three-level  wavelet 
transform  was  used  and  only  the  lowest  frequency 
was  eliminated  for  high-pass  filtering  before  image 
reconstruction.  The  high-pass  filtered  image  blocks 
were  used  as  the  input  of  the  CNN.  For  this  study,  we 
also  employed  the  grouped  jackknife  method  to 
evaluate  the  performance  of  the  CNN.  We  did  not 
ask  radiologists  to  rate  image  blocks  in  the 
mammography  training  set.  Only  two  output  nodes 
with  eight  rotations  for  input  were  used.  Neither 
output  association  nor  trainer  imposed  function  was 
employed. 

In  the  first  study,  we  randomly  selected  two  sets  of 
mammograms  (i.e.,  34  for  training  and  34  for  testing) 
with  variable  sizes  of  kernels  and  image  block.  Figure 
7  shows  the  Azs  of  various  CNN  structures  used  in 
the  experiment  with  the  same  data  set  described 
above.  In  this  figure,  the  CNN  structures  are 
indicated  by  SnlHmjKt  representing  «  x  /i  pixels  for 
input,  m  hidden  layers,  and  with  a  kernel  size  of  r  x  /. 
The  image  blocks  are  centered  on  a  suspected 
calcification  indicated  by  the  pre-scan  method.  This 
study  indicated  that  significantly  higher  ^zs  were 
obtained  when  a  square  area  of  1.7  mm  (i.e.,  16  x  16 
pixels)  region  for  the  input  and  kernel  size  of  0.52  mm 


Total  Output  Squared  Error  in  the  CNN-BP  Training 

FIGURE  7.  Az%  in  the  detection  of  clustered  microcalcifications 
using  different  CNN  parameters. 


FIGURE  8.  Two  ROC  curves  representing  the  performance  of  (a) 
CNN  using  two  outputs  and  eight  types  of  rotation  for  input  with 
the  determination  based  on  individual  microcalcifications:  the 
average  Az  =  0.89  and  (b)  CNN  using  two  outputs  and  eight 
types  of  rotation  for  Input  with  the  determination  based  on 
clustered  microcalcifications:  the  average  Az  ==  0.97. 

(i.e,,  5x5  pixels)  were  used.  In  addition,  the  use  of 
two  hidden  layers  is  better  than  the  use  of  one  hidden 
layer.  We  also  found  that  the  best  results  are  obtained 
at  a  relatively  large  square  error  (i.e.,  cost  function 
was  40-70  for  2104  cases)  which  suggests  a  fuzzy 
membership  in  the  output  or  that  more  nodes  in  the 
output  layer  may  be  necessary  for  the  optimization  of 
the  CNN  in  the  detection  of  this  database. 

Based  on  the  above  initial  study,  we  decided 
to  use  the  CNN  structure  with  the  parameter  of 
S16/H2/K5  for  the  grouped  jackknife  study  of 
the  CNN  performance.  Final  ROC  curves  were 
obtained  by  averaging  the  results  from  30  grouped 
jackknife  experiments.  Figure  8  shows  the  results 
of  using  the  CNN  and  classification  invariance  of 
matrix  operation  for  the  input.  In  this  experi¬ 
ment,  the  average  Az  was  0.89  when  the  determina¬ 
tion  was  based  on  individual  microcalcifications^ 
and  was  improved  to  0.97  when  the  determination 
was  based  on  the  clustered  microcalcifications.  In 
the  latter  method,  suspected  clusters  including 
one  or  two  calcifications  were  rejected  and  the 
average  NDDI  taken  from  the  clustered  calcifica¬ 
tions  was  used  for  the  ROC  evaluation.  One  must 
realize  that  the  detection  of  clustered  microcalcifica¬ 
tions  is  more  clinically  significant  than  individual 
calcifications,  since  the  clustered  microcalcifications 
(three  or  more)  are  a  strong  indication  of  breast 
carcinoma  in  radiological  diagnosis.  The  clustering 
procedure  was  done  by  grouping  the  detected 
microcalcifications  in  a  1  cm^  region  of  the  mam¬ 
mogram.  Only  a  minimum  of  three  clustered 
microcalcifications  was  considered  a  detection.  The 
average  ROC  curve  for  the  detection  of  clustered 
microcalcifications  indicated  that  the  CNN  can 
eliminate  90%  of  false-positive  detections,  resulting 
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in  0.5  false  clustered  detection  per  image,  and 
preserve  a  true-positive  detection  rate  of  87%. 


4.  DISCUSSION 

Medical  image  pattern  recor  ition  using  feature 
extraction  as  an  input  has  I  a  proposed  in  the 
detection  of  disease  patterns  et  al.,  1991).  Since 
only  a  small  number  of  inputs  are  used  (as  compared 
to  16x16  input  signals),  less  computation  is 
necessary  for  training.  As  long  as  the  features  of  a 
disease  pattern  are  well  defined  and  can  be  quantified 
as  values  or  vectors,  a  nonconvolution  neural 
network  should  be  able  to  classify  the  features. 
However,  both  the  proposed  diagnosis  invariance 
and  the  output  assignment  methods  for  the  enhance¬ 
ment  of  disease  detections  may  only  be  used  in 
limited  cases.  On  the  other  hand,  the  structure  of  the 
convolution  neural  network  is  complicated  and 
requires  more  computations,  particularly  for  the 
training.  The  CNN  does  not  require  the  feature 
extraction  of  disease  patterns  from  the  image  and  is 
capable  of  distinguishing  non-disease  patterns  from 
disease  patterns.  A  potential  advantage  of  using  the 
proposed  CNN  is  that  feature  extraction  can  be  more 
specifically  defined  not  only  by  the  user’s  experience 
but  also  by  the  confirmation  of  the  CNN  when  the 
function  of  each  kernel  is  discovered.  Some  com¬ 
plementary  features  learned  by  the  CNN  may  be  able 
to  contribute  image  information  of  a  disease  thereby 
assisting  the  radiologist  in  better  understanding  all 
the  features  of  the  disease.  Further  investigation  of 
the  CNN  specified  features,  other  than  known 
features,  should  be  very  interesting  to  radiologists 
and  imaging  scientists. 

In  this  work  we  used  preliminary  scanning 
methods  to  define  suspected  abnormal  areas.  The 
final  disease  classification  was  analyzed  by  using  an 
artificial  convolution  neural  network  with  back- 
propagation  training.  We  proposed  several  methods 
to  mimic  the  radiologists’  reading  patterns  in 
detecting  diseases  on  radiographs.  Though  conven¬ 
tional  image  processing  techniques  can  capture  true 
diseases,  many  false-positive  detections  are  obtained. 
We  found  that  the  CNN  substantially  reduced  the 
number  of  false-positive  detections. 

In  this  study,  we  designed  the  convolution  neural 
network  to  focus  on  local  information  with  expert- 
trained  output  distribution.  The  use  of  diagnosis 
invariance  of  rotation  seems  likely  to  enhance  the 
performance  of  the  CNN  by  virtually  increasing  the 
number  of  training  cases.  It  is  obvious  that  both  the 
expert-trained  output  distribution  and  the  classifica¬ 
tion  invariance  of  matrix  operations  are  not  only 
applicable  to  CNN  but  also  to  a  conventional  neural 
network  as  long  as  an  image  (or  an  image  associated 


vector  which  depends  on  image  orientation)  is  used  in 
the  input  layer. 

Summarizing  the  failure  cases  in  the  study  of 
lung  nodule  detection,  we  found  that  the  majority 
of  false-negatives  related  to  nodules  partially  over¬ 
lapped  with  rib  and  many  false-positives  related  to 
end-on  vessels.  This  is  because  our  training 
database  was  small  and  did  not  have  enough  true 
cases  to  cover  various  situations  in  rib  overlapping 
on  nodules  and  did  not  have  enough  false  cases  to 
cover  various  contrasts  of  end-on  vessels.  We 
believe  that  the  performance  of  the  CNN  will  be 
greatly  improved  when  the  training  cases  are 
suflSciently  expanded  in  the  future  study. 

One  may  interpret  eqn  (16)  as  another  network 
fully  connected  to  a  single  output  node.  This 
subnetwork  can  be  included  in  the  backpropagation 
training  with  a  linear  activation  function  for  the 
output  node.  However,  this  subnetwork  does  not 
ensure  that  the  backpropagated  signals  on  the 
previous  layer  (i.e.,  the  output  layer  consisting  of  10 
nodes)  are  matched  with  the  radiologists’  scores.  The 
use  of  10  nodes  in  the  output  layer  also  provides 
flexibility  for  the  researcher  to  investigate  the 
migration  of  kernel  changes  when  an  additional 
training  strategy  is  added.  The  fuzzification  of  the 
teaching  signals  and  the  use  of  a  trainer  imposed 
function  are  examples  of  the  training  strategies  used 
in  this  paper.  The  kernel  changes  corresponding  to 
the  training  can  be  important  information  for  future 
optimization  of  the  CNN  algorithm  associated  with 
disease  pattern  recognition. 

In  this  study,  we  learned  that  the  background 
reduction  was  a  necessary  procedure  for  the  detection 
of  both  lung  nodules  and  mammographic  micro¬ 
calcifications  otherwise  the  error  function  would  not 
reach  a  minimum  for  the  training  data  set.  Several 
broad  output  distributions  were  also  tested.  The 
CNN  performance  (i.e.,  generation)  of  those  tests 
were  inferior  to  that  of  the  narrow  output  distribu¬ 
tion.  A  comparison  experiment  was  also  conducted  to 
evaluate  the  difference  between  the  training  using 
image  groups  (the  original  and  its  seven  “brother” 
image  blocks  as  one  group)  and  image  blocks  (all 
image  blocks).  We  found  that  the  CNN  seems  to 
perform  better  using  image  groups  than  randomizing 
each  image  block  in  the  training.  We  also  modified 
our  neural  network  structure  to  one  hidden  layer.  The 
CNN  performance  with  one  hidden  layer  was  not  as 
effective  (the  average  Az  =  0.81  for  kernel  size  of 
5x5  and  the  average  Az  ==  0.85  for  kernel  size  of 
13  X  13)  as  when  two  layers  were  used  for  the 
detection  of  microcalcifications.  However,  the  per¬ 
formance  was  about  the  same  with  one  hidden  layer 
and  two  hidden  layers  for  the  study  involving  lung 
nodules.  We  do  not  know  whether  this  effect  was  due 
to  the  fine  structure  of  microcalcifications  or  smaller 
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samples  used  in  the  experiment  of  lung  nodule 
detection.  We  are  currently  testing  three  hidden 
layers  to  see  if  there  is  any  improvement  in  the 
generalization.  We  are  also  working  on  32  x  32 
original  image  block  and  expect  better  outcomes. 
By  increasing  the  sizes  of  image  block  and  kernel,  the 
computation  time  will  increase  10-16  times  as  much 
as  the  training  time  needed  for  the  CNN  configura¬ 
tion  described  in  Section  2.2.1.  When  the  database  is 
expanded,  a  higher  power  computer  will  be  required 
for  the  training. 

5.  CONCLUSIONS 

In  this  study,  we  have  added  several  effective 
techniques  to  the  convolution  neural  network  for 
the  enhancement  of  disease  diagnosis:  (a)  develop¬ 
ment  of  a  better  background  reduction  method  so 
that  the  neural  network  has  a  better  “observation”  of 
the  image  block,  (b)  providing  radiologists’  rating 
scale  for  the  backpropagation  training,  (c)  introdu¬ 
cing  the  neural  network  with  the  classification 
invariance  of  input  matrix  operations,  (d)  use  of 
output  association  functions  to  mimic  the  radiolo¬ 
gists’  interpretation  and  to  establish  the  relationship 
between  adjacent  output  nodes,  and  (e)  rendering 
trainer  imposed  functions  to  enhance  the  perfor¬ 
mance  of  CNN.  We  found  that  the  performance  of 
the  CNN  in  detecting  disease  was  improved 
significantly  by  administering  these  training  methods. 

Studies  in  the  use  of  chest  radiographs  for  the 
detection  of  lung  nodules  (Stitik  et  al.,  1985;  Hellan  et 
al.,  1984)  have  demonstrated  that  even  with  highly 
skilled  and  highly  motivated  radiologists  working 
with  high  quality  chest  radiographs,  only  68%  of  all 
retrospectively  detected  lung  cancers  were  detected 
prospectively  when  read  by  one  reader,  and  only  82% 
were  detected  by  two  readers.  Our  studies  did  not 
have  the  same  clinical  setting  as  Stitik  and  Hellan’s 
due  to  our  smaller  database,  therefore,  we  could  not 
compare  our  results  with  the  radiologists’  sensitivity 
of  68%  mentioned  above.  However,  we  consider  it 
likely  that  radiologists  will  benefit  from  the  use  of  a 
nodule  detection  program  such  as  this  in  one  of  two 
ways.  First,  the  radiologist  will  use  the  program  as  a 
second  reader,  thus  increasing  the  detection  of  lung 
nodules  similar  to  the  results  seen  in  the  study  by 
Stitik.  In  the  second  method,  the  radiologist  may  call 
on  the  system  as  a  consultant  on  an  individual 
suspected  area.  The  radiologist  can  point  to  the 
suspected  area  and  ask  for  the  interpretation  from  the 
CNN  system.  The  CNN  system,  in  fact,  may  be  able 
to  work  as  a  trained  referral  system  for  the 
consultation  of  detecting  lung  nodules.  Such  a 
program  is  also  readily  available  in  our  computer 
and  clinical  evaluation  is  in  progress.  A  fully 
automatic  lung  nodule  detection  program  takes  12- 


18  s  for  a  512  x  512  digital  chest  radiograph  in  a 
DEC  Alpha  workstation.  To  evaluate  an  identified 
area,  it  only  takes  the  CNN  program  0.2  s  to  respond. 

This  work  has  demonstrated  two  successful 
medical  diagnostic  applications  using  an  artificial 
visual  neural  network  and  expert-trained  computer 
procedures  instead  of  a  non-convolution  neural 
network  or  other  conventional  classification  meth¬ 
od.  This  technique  attempted  to  simulate  the 
radiologists’  reading  pattern:  pre-screen  and  classifi¬ 
cation  for  interpretation.  We  believe  that  the 
proposed  convolution  neural  network  and  its 
associated  training  techniques  can  be  extended  to 
many  diagnostic  imaging  areas  such  as  the  detection 
of  low  contrast  mass  in  mammography  and  the 
pattern  recognition  of  interstitial  lung  disease  in  chest 
radiography.  In  fact,  the  proposed  CNN  technique 
should  be  able  to  be  trained  to  detect  almost  all 
disease  patterns  perceivable  by  a  trained  radiologist. 
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ABSTRACT 

A  neural  network  based  framework  has  been  developed  to  search  for  an  optimal  wavelet 
kernel  that  is  most  suitable  for  a  specific  image  processing  task.  In  this  paper,  we  demonstrate  that 
only  the  low-pass  filter,  hu,  is  needed  for  orthonormal  wavelet  decomposition.  A  convolution  neural 
network  can  be  trained  to  obtain  a  wavelet  that  minimizes  errors  and  maximizes  compression 
efficiency  for  an  image  or  a  defined  image  pattern  such  as  microcalcifications  on  mammograms.  We 
have  used  this  method  to  evaluate  the  performance  of  tap-4  orthonormal  wavelets  on  mammograms, 
CTs,  MRIs,  ^d  Lena  image.  We  found  that  Daubechies'  wavelet  (or  those  wavelets  possessing 
siimlar  filtering  characteristics)  produces  satisfactory  compression  efficiency  with  the  smallest  error 
using  a  global  measure  (e.g.,  mean-square-error).  However,  we  found  that  Harr's  wavelet  produces 
the  best  results  on  sharp  edges  and  low-noise  smooth  areas.  We  also  found  that  a  special  wavelet 
whose  low-pass  filter  coefficients  are  (0.32252136, 0.85258927, 0.38458542,  -0.14548269),  can  ’ 
greatly  preserve  the  microcalcification  features  such  as  signal-to-noise  ratio  during  a  course  of 
compression.  Several  interesting  wavelet  filters  (i.e.,  the  g  filters)  were  reviewed  and  explanations  of 
the  results  are  provided.  We  believe  that  this  newly  developed  optimization  method  can  be 
generalized  to  other  image  analysis  applications  where  a  wavelet  decomposition  is  employed. 


1.  Introduction 

In  the  field  of  transform  coding,  discrete  cosine  transform  (DCT)  based  decomposition 
methods  were  developed  extensively  in  1970's  and  1980’s.  Most  of  the  techniques  developed  in  this 
area  are  associated  with  block  DCT**^.  However,  several  investigators  indicated  that  the  use  of  full- 
fr^e  DCT^-'^  can  produce  high  compression  efficiency  with  high  data  fidelity  and  without  blocky 
artifact.  This  method  is  particularly  appropriate  for  high-resolution  large-sized  images.  Recently, 
sub-band  and  wavelet  transformations  have  been  widely  used  in  image  compression  research^"^®. 
Unlike  DCT,  there  exists  many  discrete  wavelet  transform  (DWT)  filters  that  can  perform  data 
decomposition.  This  paper  provides  a  neural  network  approach  to  search  for  an  optimal  wavelet  that 
minimizes  quantization  errors  and  at  the  same  time  produces  the  highest  compression  efficiency. 
This  method  can  also  be  extended  to  evaluate  various  wavelets  in  preserving  defined  image  features. 


2.  Algorithm  Development 

2. 1  ■  Construct  a  Neural  Network  using  Wavelet  Decomposition 

The  artificial  neural  network  described  in  this  paper  is  based  on  the  convolution  process  which  is 
used  in  the  sub-band  including  wavelet  decomposition.  In  fact,  the  wavelet-based  neural  network 
performs  exactly  the  same  as  the  conventional  wavelet  transform.  Our  approach  is  to  use  the  training 
capability  of  the  neural  network  to  obtain  the  most  suitable  wavelet  kernel  for  a  specific  signal 
processing  task^k  In  this  paper,  our  task  is  to  minimize  error  and  simultaneously  achieve  the  highest 
compression  efficiency  during  the  course  of  compression  and  decompression  processes.  In  order  to 
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match  the  sub-band  decomposition,  several  characteristics  of  the  neural  network  must  be  established:  (a) 
no  hidden  but  one  output  layer  is  used,  (b)  local  connection  through  convolution  process  rather  than 
fully  connected  nets  is  employed,  and  (c)  the  convolution  process  must  be  inversible  (wavelet  kernels 
are  used  in  this  paper).  During  compression  and  decompression  processes,  the  inversible  process  is 
approximately  conducted.  The  approximation  is  not  due  to  the  inverse  transformation  but  because  the 
inaccuracy  of  the  quantized  transform  coefficients.  Figure  1  shows  the  structure  of  the  neural  network 
using  quantized  transform  coefficients  as  the  targets. 


Wavelet  Quantization  eirors  to  train 

kernel  the  neural  network 


Figure  1 .  A  neural  network  based  on  wavelet  decomposition  and  trained  by  quantization  errors. 

T(i,j)  and  QT(i,j)  denote  transform  and  quantized  coefficients  in  high-frequency  domains,  respectively. 


In  fact,  we  should  not  consider  only  the  issue  regarding  minimization  of  quantization  errors.  The 
minimization  of  entropy  must  also  be  taken  into  account  for  the  optimization.  We  combine  both  issues 
by  multiplying  the  mean-square-error  function  with  an  imposed  entropy  reduction  function.  The  cost 
(error)  function  for  training  the  neural  network  becomes 

EfUJ)  =  Z{QT{i,j))  X  [T{i,j)  -  QT{Uj)fl2  ...(1) 

where  Qr(ij)  is  the  quantized  transform  coefficient  at  pixel  (i,j)  and  Z(QT(i,j)),  which  is  the  entropy 
reduction  function  for  a  set  of  quantization  coefficients,  is  given  below: 

f  0  for  QT(iJ)  =  0 

Z{QT{iJ))  =  \  1  for  \QT{i,j^=\  ...(2) 

[  F(n,  q)  for  i  QT  (i,  7)!  =  n  • 

E(n,q),  which  is  a  ramp  function,  is  a  function  of  quantization  factor,  q,  and  is  somewhat  inversely 
proportional  to  the  quantized  integer,  n.  The  value  of  the  ramp  function  should  always  be  smaller  than  1. 

The  reason  to  design  the  entropy  reduction  function  for  a  fixed  quantizer,  q,  using  eq.  (2)  is 
three-fold:  (a)  since  most  low  value  coefficients  (-0.5q  <  T(i„j)  <  0.5q)  are  associated  with  noise  when  q 
is  not  a  very  large  value,  there  is  no  need  to  emit  error  from  the  output  node  possessing  quantized  value 
0  to  train  the  neural  net;  (b)  the  more  the  low  quantized  values  are,  the  lower  the  assemble  entropy  will 
be;  and  (c)  the  probability  to  turn  a  high  quantized  value  into  a  low  quantized  value  is  very  low, 
therefore  errors  backpropagated  from  high  quantized  values  should  be  less  emphasized  as  compared  to 
low  quantized  values  1, 2,  or  so.  When  q  is  very  small,  the  quantization  error  is  in  the  range  of  global 
image  noise.  In  this  case,  the  neural  network  will  rely  on  the  guidance  of  Z  function  to  search  for  a 
wavelet  filter  that  produces  more  low  transform  values.  The  success  of  this  cost  (error)  function  design 
is  depicted  in  our  experiment  shown  in  the  Results  Section. 
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Based  on  the  neural  network  shown  in  Figure  1,  we  can  train  the  convolution  kernel.  The 
specific  training  algorithm  is  given  in  Section  2.2.  Unfortunately,  the  neural  network  suggested  kernel 
may  not  be  a  wavelet  kernel.  Section  2.4  shows  a  method  to  conduct  wavelet  decomposition  without 
using  the  high-pass  filter.  Hence,  the  low-pass  filter  is  the  only  kernel  to  process  the  4  channels  for  two- 
dimensional  (2-D)  wavelet  decomposition.  Section  2.5  provides  algorithms  that  will  modify  the  kernel 
to  fulfill  the  requirements  of  wavelet  kernel.  Through  this  process,  we  can  find  a  wavelet  that  produces 
the  lowest  quantization  errors  with  the  lowest  entropy  of  the  quantized  transform  coefficients. 


2.2.  Signal  Propagation  through  Convolution  Process  and  Methods  for  Training  the  Neural  Network 
The  signal  propagation  from  input  layer  to  output  layer  involving  convolution  computation  is 
given  below: 


T,{i,j)  =  K,{u,v)®S{i,j) 


...(3) 


where  S(i,j)  is  the  original  image,  subscript  c  denotes  the  channel  number,  and  KJu,v)  is  the  convolution 
kernel  for  channel  c.  For  the  wavelet  decomposition,  the  relationship  between  Kc(u,v)  and  the  wavelet 
filters  (i.e.,  h  and  g  filters)  will  be  given  in  Sections  2.3  and  2.4. 

Since  we  treat  the  wavelet  transform  as  a  locally  connected  neural  network,  the  well-known 
backpropagation  (BP)  training  method  can  be  used  to  train  the  weights  (kernel)  in  each  epoch  Note 
that  a  linear  function  instead  of  a  typical  sigmoid  function  for  a  conventional  neural  network  system  is 
used  in  this  process.  The  updated  kernel  suggested  by  backpropagation  in  the  neural  network  is  given  by 

Kc{u,v)[t  + 1]  =  isTcCw,  v)[r]  -t-  T]lS{i,j)S{i  -  uj  -  v)  +  ai^c{u,v){t]  ...(4) 

U 


where  t  is  the  iteration  number  during  the  training,  a  is  the  gain  for  the  momentum  term  received  in 
the  previous  learning  loop,  Tj  is  the  gain  for  the  current  weight  changes,  and  6  is  the  weight-update 
function  which  is  given  by 


S{iJ) 


dEf 

dKc{u,v) 


...(5) 


2.3.  Two-Dimensional  Wavelet  Decomposition 

Following  Mallat's  2-D  wavelet  analysis^,  the  two-dimensional  scaling  function  is  composed  of 
two  one-dimensional  scaling  functions  in  both  directions: 


4>ix,y)  =  (t)(x)(t>(y)  ...(6) 

where  0{x)  is  a  scaling  function.  The  associated  two-dimensional  wavelets  are  defined  as 

...(7) 

¥^ix,y)=¥(x)(l>(y)  ...(8) 

y/^{x,y)=y/(x)y/{y)  ...(9) 


where  V^(a:)  is  the  1-D  wavelet  corresponding  to  the  1-D  scaling  function.  Using  the  sub-band  coding 
algorithm,  the  wavelet  transform  (2-D  DWT)  of  a  matrix  has  four  parts: 


SPIE  Vol.  2707  /  203 


^LLif{x,y))=  'L[(fix,y)h(u-2x,0))K0,v-2y)]=  l[f(x,y)hiiiu-2x,v-2y)]  ...(10) 

M,V  U,V 

^lJi^f{x,y))=  I[(/(;c,y)ft(M-2x,0))^(0,v-2y)]=  'L[f{x,y)hui{u-2x,v-2y)]  ...(11) 

M,V 

^//L(/(^>y))=  l[if(x,y)g{u-2x,0))h(0,v-2y)]=  'L[f(x,y)hfjLiu-2x,v-2y)]  ‘  ...(12) 
u,v  u,v 

^HH(f(^’y))=  'L[(f(x,y)g(u-2x,0))g(0,v-2y)]  =  S[/(j:,y)%^(M -2x,v- 2>0]  ...(13) 

u,v  U,v 

where  h  and  g  functions  are  the  low  and  high  pass  filters  of  the  sub-band  decomposition  with  condition 
g(u)  =  (-1)“^(1  -  u).  The  low  pass  filter,  h,  dso  must  satisfy  three  criteria  to  construct  the  orthonormal 
basis  of  compactly  supported  wavelets^-^;  (Note  that  we  also  use  gu  and  hu  to  replace  g(u)  and  h(u), 
repectively,  for  simplicity  in  this  paper.) 


(a) 


Lm 


-^^2/2  = 


-V2/2  =  0; 


(b)  should  be  orthonormal;  this  means  that 


^K^K+ln 


-5, 


u,u+2n 


=  0 


where  5;;  is  Dirac  delta  function  and  n  is  an  integer;  and 
(c)  have  a  high  degree  of  regularity. 


...(14) 

...(15) 


From  the  compression  perspectives,  the  above  constraints  are  very  limited.  For  a  lossless  compression, 
those  filers  performing  perfect  reconstruction  are  illegible.  However,  we  would  like  to  focus  our  view 
on  using  wavelet  transform  in  this  paper. 

The  2-D  filters  at  the  second  forms  of  eqs.  (10-13)  are  the  vector  products  of  h  and/or  g  filters. 
The  relationship  between  high  pass  and  low  pass  filters  make  the  unification  of  the  four  sets  of 
decomposition  possible  as  shown  in  section  2.4. 

According  to  the  wavelet  theory,  it  is  known  that  given  a  set  of  h,  one  can  calculate  the  Fourier 
transform  of  the  scaling  and  wavelet  functions  as  follows: 


<I>(w)  =  /  2)  ...(16) 

'P(w)  =  Hi(e''^^2)<D(w/2)  ...(17) 


where  Hq  and  H}  are  Fourier  transforms  of  h  and  g  filters,  respectively.  Hence,  both  the  scaling  and 
wavelet  Unctions  can  be  obtained  through  infinite  recursion  by  using  eqs.  (16)  and  (17),  respectively. 


2.4.  Unification  of  the  Four  Channels  Decomposition  in  2-D  DWT 

Using  Eq.  (1 1)  as  an  example  to  rewrite  the  decomposition  equation  by  replacing  the  g  with  the  h 
filter,  we  have: 

^LH^f{x,y))  =  l[(/(^,y)/i(M-2x,0))(-l)'''/i(0,2y  +  l-v)l  ...(18) 

or 


WLH(fix,y))  =  I[(((-1)V(a: -y))/i(M-2x,0));i(0,v-2y)] 

=  S[(((-l)^/(^,-y))^Li,(M-2x,v-2y)l=  l[/£,/f(x,y)/ii2,(M-2x,v-2y)]- 

K,v  U,V 
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Converting  Eq.  (14)  to  use  the  2-D  low  pass  filter  as  the  kernel  is  a  matter  of  changing  the 
orientation  from  y-  to  x-direction  (or  combining  both  directions  for  Eq.  (15)).  These  conversions  also 
indicate  that  one  can  use  a  single  2-D  filter  to  compute  the  four  quadrants  of  the  2-D  wavelet  transform 
by  flipping  the  matrix  position  in  x-  and/or  y-direction(s)  and  alternating  the  sign  of  the  flipped  matrix 
corresponding  to  the  direction(s). 

The  alternated  sign  of  the  source  vector  makes  the  convolution  operation  unconventional.  A 
precalculation  method,  that  involves  a  cross  product  of  two  vectors,  can  be  employed:  flipping  data 
sequence  of  an  image  is  the  first  vector  and  the  second  vector  is  fixed  and  composed  of,+l  and  -1.  An 
example  of  1-D  precalculation  steps  for  tap-6  kernel  prior  to  the  convolution  operation  is  given  below: 


Original  data  sequence: 
Flipped  data  sequence: 
Resultant  data  sequence: 


^6  ’^5  ’^4  ’^3  ’^2  ’^1 


In  the  case  of  2-D,  three  matrices  associated  with  horizontal,  vertical,  and  diagonal  decomposition  for 
the  second  matrix  in  precalculation  are  given  below  in  Figure  2.  With  this  precalculation  (or  cross 
product  of  two  matrices),  only  the  low-pass  filter  hyhv  (hu  in  1-D)  is  needed  for  the  final  wavelet 
transform  operation. 


+  -h  -h  -{-  +  -h 

+  +  +  +  +  + 

+  -h  -{-  H-  -h  -h 


Vertical  operator 


^  + 

+  -  + 

+  -  + 

-h  -  -h 
-1-  -  + 

-  -1- 

Horizontal  operator 


+  ~  +  — 

+  -  +  -  -h  - 

-  +  -  +  -  + 

+  —  -f.  —  ^  — 

-  -f  -  H-  ~  -i- 

Diagonal  operator 


Figure  2.  Three  matrices  used  for  the  cross  product  precalculation. 


Nevertheless  the  resultant  matrix  of  this  precalculation  (or  cross  product  of  two  matrices)  must 
be  held  in  the  computer  memory  to  facilitate  the  computation  for  forward  convolution  and  the 
corresponding  backpropagation.  After  precalculation,  the  size  of  the  intermediate  images  is  {k/2xk/2) 
times  the  original  image  size.  The  factor  of  1/2  x  1/2  is  due  to  the  1/2  down  sampling  two-dimensionally  ' 
in  a  conventional  forward  wavelet  transform.  The  largest  three  blocks  shown  in  Figure  3  are  the 
intermediate  images  So(xk/2,  yk/2). 

One  of  the  original  criteria  regarding  the  so-called  "high  degree  of  regularity"  was  not  enforced 
in  the  algorithm.  The  orthonormality  of  the  hu  filter  may  not  be  self-sustained  with  each  updated 
version.  However,  some  small  modification  is  possible  to  make  the  final  version  of  hu  orthonormal,  if 
the  conditions  of  being  a  wavelet  filter  set  are  to  be  fully  met.  Based  on  each  precalculated  image 
So(xk/2,yk/2)  described  earlier,  Eq.  (4)  can  be  rewritten*for  updating  2-D  wavelet  kernel 

K{u,v)[t -Hi]  =  K{u,v)[t\  +  T]'^S{i,j)SQ{xk/2 - u,ykl2  -  v)  +  aA^(M,v)[r]  ...(20) 

U 

where  index  i  =  0,1,. ..(*-1)2  corresponds  to  the  sub-image  of  matched  to  the  kernel  size.  Eq.  (20) 

represents  the  updated  kernel  suggested  by  the  BP,  these  values  require  a  conversion  to  a  new  wavelet 
kernel  h'uh’v  Assuming  the  wavelet  filter  is  a  2-D  vector  (i.e.,  huK=  h^hu  =  /zu.,  where  u&v=  0,1,2, ... 
*-l),  then  only  k  free  parameters  ought  to  be  trained  for  a  wavelet  transform.  A  solution  to  satisfy  the 


mmiT.  ■  w:;'- 


SPIE  Vol.  2707  /  205 


wavelet  constraints  and  to  make  h'uh\  approximately  equal  to  K'(u,v)  is  given  in  section  2.5. 


h^hy  filter  kernel 


1 


❖  convolution  operation 
Q  quantization 
E( )  entropy  calculation 

precalculation  for  horizontal  convolution  operation 
precalculation  for  vertical  convolution  operation 
precalculation  for  diagonal  convolution  operation 


h’^h  y  updated  filter  kernel 

down  sampling  by  a  factor  indicated 
Q-1  reverse  quantization 

- error  back-propagation  training  through  inverse  convolution 


B 


Matrix  C  is  the  result 
of  subtracting  matrices 
A  from  B. 


Figure  3.  A  proposed  training  scheme  based  on  a  grouped  (kernel)  backpropagation  neural  network  to 
obtain  an  optimal  orthonormal  kernel  for  image  compression. 


2.5.  Converting  Neural  Network  Suggested  Kernel  to  Fulfill  Requirements  of  a  Wavelet  Filter 

As  indicated  in  Eq.  (20),  the  updated  weights,  K(u,v)[t+1]  or  K'(u,v)  of  the  kernel  suggested  by 
the  BP  at  t+1  training  iteration  are  independent.  One  must  realize  that  each  epoch  in  the  neiTral  network 
training  is  only  a  suggestion  or  approximation  that  the  changes  of  weights  may  produce  a  lower  value 
for  the  defined  error  fonction,  Ef.  To  properly  use  this  suggestion  for  making  a  new  wavelet  kernel,  let's 
assume  that  there  exists  a  set  of  h'u  so  that  the  updated  2-D  version  of  the  wavelet  filter  is  very  close  to 
K'(u,v).  A  function  based  on  the  square  difference  is  used  in  the  derivation 
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...(21) 


u,v 

Here  we  intend  to  minimize  the  function,/  subject  to  the  constraints  equations.  Lagrangian  multiplier 
method  can  be  employed  to  solve  this  problem  by  combining/and  constraint  equatfonsr 

df(h\)  +  lXpdCp(h\)=:0  ...(22) 

P 

where  d  represents  the  differentiation  operation  of  a  function  and  is  the  Lagrangian  multiplier  for 
the  corresponding  constraint  equation,  )  =  0  ,  referred  to  eqs  (14)  and  (15).  Using  this  approach 

we  can  obtain  a  set  of  h'^  while/is  also  minimized. 


3.  Materials  and  Experimental  Methods 

A  database  consisting  of  45  mammograms  was  used  to  conduct  the  study.  Of  these 
mammograms,  38  contain  biopsy  proven  clustered  microcalcifications.  A  total  of  220 
microcalcifications  are  embedded  in  41  clusters.  All  45  mammograms  were  digitized  by  a  LumyScan 
(model  150)  film  digitizer  with  spot  size  of  100|im.  Each  patch  of  32x32  pixels  (i.e.,  an  area  of 
3.2  X  3.2  mm2)  with  its  center  at  the  peak  value  was  isolated  for  the  study  of  quantization  impact  on 
microcalcifications.  The  process  of  searching  optimal  wavelet  kernels  for  original  mammograms  and 
microcalcification  patches  were  conducted.  Each  image  was  decomposed  by  3-level  wavelet  transform. 
Quantization  values  were  q,  q/2,  and  q/4  for  decomposition  of  high  frequency  coefficients  on  levels  1,  2, 
and  3,  respectively.  For  each  training  epoch,  the  mean-square-error  (MSE)  and  %zeros  (i.e.,  number  of 
zeros  /  total  number  of  pixels)  were  computed.  Since  %zeros  generally  contributes  the  most  important 
factor  to  gain  a  compression,  it  can  be  used  as  a  coarse  index  for  the  evaluation  of  compression 
efficiency  for  each  epoch. 

In  order  to  demonstrate  each  wavelet  performance,  we  sorted  the  first  coefficient  ho  of  the  low- 
pass  filter  associated  with  the  mother  scale  function  as  the  horizontal  scale  because  the  training  epoch 
does  not  represent  the  wavelet  being  used  as  shown  in  Figures  6  and  7.  All  ho  values  are  greater  than 
-0.1464466094  and  smaller  than  0.85255533905.  The  corresponding  hj  values  are  greater  than 
0.35355339  and  smaller  than  0.85255533905.  Those  h]  values,  which  are  greater  than  -0.1464466094 
and  smaller  than  0.35355339,  have  corresponding  conjugate  values  in  the  former  set  and  can  be  ignored. 

Compression  ratios  were  calculated  only  when  the  neural  network  search  had  been  successful. 
The  spatial  and  temporal  correlation  of  quantized  coefficients  were  taken  into  account  but  might  not  be 
optimized.  Specifically,  we  arranged  quantized  coefficients  from  one  pixel  of  the  highest  level  to  the 
corresponding  4  pixels  on  the  second  highest  level  to  1 6  pixels  on  the  lowest  level  and  then  went  back  to 
the  next  pixel  of  the  highest  level  and  so  on.  This  rearranged  data  sequence  is  more  correlated  in  a 
spatial-temporal  sense^^  and  can  be  encoded  effectively  by  Lempel-Ziv  codingi'^. 

We  have  also  performed  the  same  study  for  the  isolated  220  microcalcification  patches.  The  2-D 
profiles  of  microcalcifcations  and  their  nearby  areas  (i.e.,  the  areas  that  are  not  included  in  the 
microcalcification  profile  but  within  the  isolated  block  32x32  pixels)  were  evaluated  separately  during 
the  course  of  the  neural  network  search.  In  addition,  features  of  the  microcalcifications  were  computed 
to  observe  their  changes.  These  features  of  microcalcificadon  are: 

(a)  the  peak  value,  P; 

(b)  the  contrast,  C  =  P-b; 

where  b  is  the  average  background  value  which  is  the  immediate  boundary  of  the 
microcalcification  profile; 
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(c)  the  signal-to-noise-ratio,  SNR  =  C/SDb; 

where  SDb  stands  for  the  standard  deviation  of  the  background;  and 

(d)  the  area  occupied  by  the  2-D  microcalcification  profile,  A. 


4.  Results 

In  the  neural  network  training,  the  MSE  is  not  the  only  factor  to  be  concerned;  the  entropy 
reduction  function  is  another  factor  that  drives  the  neural  network  to  perform  a  search.  In  the  first  neural 
network  experiment,  we  found  that  the  MSE  changes  very  small  with  a  low  quantization  factor  (q=16). 
The  neural  network  movement  in  searching  for  the  next  wavelet  kernel  was  random  and  no  minimum  of 
MSE  could  be  found  in  the  mammogram  study.  However,  the  %zeros  changed  which  led  the  neural 
network  to  converge  at  the  maximum  value  of  %zeros.  When  a  larger  quantization  factor  (q=64)  was 
used,  the  MSE  seems  to  function  in  training  the  neural  network.  Figures  4  and  5  show  the  curves  of 
MSEs  and  %zeros  against  the  sorted  ho  values.  In  both  figures,  Daubechies'  (ho  =  0.48296291)  and  its 
nearby  wavelets  perform  the  highest  %zeros  implying  the  largest  compression  ratio. 
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Figure  4.  Decomposition  Performance  of  Wavelets  Figure  5.  Decomposition  Performance  of  Wavelets  on 

on  Mammograms  (q=16).  Mammograms  (q=64). 


In  the  microcalcification  study,  we  found  that  %zeros  does  not  change  much  until  ho  >  0.6. 
Figure  6  shows  the  original  learning  steps  which  drive  MSEs  into  lower  values  using  the  proposed 
neural  network  training  mechanism.  Figure  7,  which  is  a  sorted  version  of  Figure  6  (same  sorting 
processes  were  applied  to  all  figures  in  this  section),  shows  that  Daubechies'  wavelets  perform  the 
lowest  MSEs.  More  specifically,  microcalcification  profiles  suffered  higher  MSEs  than  their 
background  areas  as  indicated  in  Figure  8. 

These  results  were  altered  when  a  very  large  quantization  factor  was  used.  In  Figure  9,  all  the 
microcalcification  patches  were  rounded-off  to_8-bit  prior  to  the  study  which  assumed  digitized 
mammograms  containing  about  4-bit  of  noise^^.  Although  the  largest  quantization  factor  was  16  for 
8-bit  mammograms,  the  effective  quantization  factor  was  equivalent  to  =256  in  12-bit  mammograms. 
Figure  9  shows  that  Harr's  wavelet  (ho  =  0.0)  performs  a  high  and  the  lowest  MSEs  for  2-D 
microcalcification  profiles  and  their  background,  respectively.  However,  Daubechies'  wavelets 
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perform  in  an  opposite  way.  This  is  probably  because  Harr's  wavelet  can  produce  a  lower  entropy  for 
low-noise  smooth  areas. 


Figure  6.  MSEs  were  Decreased  During  the  Training  of 
the  Neural  Network  on  220  Microcalcificaiions  (q=64). 


Soned  hO  Values  of  Various  Wavelet  Scale  Functions 

Figure  8.  Decomposition  Performance  of  Wavelets  on  220 

Microcalcification  Profiles  and  Background  (q=64). 


Sorted  hO  Values  of  Various  Wavelet  Scale  Functions 

Figure  7.  Decomposition  Performance  ofWavelets  on 
220  Microcalcifications  (q=64). 


Sorted  hO  Values  of  Various  Wavelet  Scale  Functions 


Figure  9.  Decomposition  Performance  ofWavelets  on  220 
Microcalcification  Profiles  and  Background  (8-bit,  q=16). 


The  results  of  the  microcalcification  evaluation  study  based  on  quantized  wavelet  coefficients 
are  shown  in  Figures  10-13.  In  fact,  the  evaluation  was  performed  with  an  identical  experimental 
condition  as  that  in  Figure  9.  However,  microcalcification  features  were  measured  instead  of  MSEs 
and  %zeros.  Note  that  %  number  decrease  in  peak  values,  contrast,  and  SNR  were  shown  in  negative 
values.  In  other  words,  the  lower  the  %  number  decrease  value  is,  the  more  microcalcifications 
involving  negative  changes.  The  figure  of  merit  (FOM)  for  each  measure  was  a  composed  value 
given  by 
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FOM  =  (  %No.  decrease  x  %  decrease  +  %No.  increase  x  %  increase)  x  100, 


...(23) 
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Figure  10.  Peak  Values  Changes  Due  to  Quantization  Effects 
on  Wavelet  Domain  for  Microcalcifications. 
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Figure  1 1 .  Contrast  Changes  Due  to  Quantization  Effects 
on  Wavelet  Domain  for  Microcalcifications. 


Figure  12.  SNR  Changes  Due  to  Quantization  Effects  on  Figure  13.  Areas  of  Microcalcification  Profile  Changes  Due 
on  Wavelet  Domain  for  Microcalcifications.  to  Quantization  Effects  on  Wavelet  Domain. 


As  indicated  in  Figure  10,  the  peak  values  were  changed  very  little.  However,  %  number 
increases  in  peak  values,  contrast  values,  and  SNRs  of  microcalcifications  had  about  the  same 
distribution  in  Figures  10,  11,  and  12.  The  highest  FOMs  in  all  three  measures  were  at  the  wavelet  with 
the  low-pass  filter  coefficients:  (0.32252136,  0.85258927,  0.38458542,  -0.14548269)  which  is  marked 
with  an  arrow  sign  in  the  Figures.  Figure  1 3  shows  minor  %area  changes  of  microcalcification  profiles 
from  0.2  to  0.6  of  ho  values.  These  effects  were  not  observed  when  a  low  qu^tization  factor  was  used. 
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5.  Discussion 

After  observing  these  results,  one  may  still  be  confused  about  what  was  going  on.  The  authors 
would  like  to  provide  some  explanations  in  the  following  discussion.  Let’s  start  with  graphics  of  the  low- 
pass  h  and  the  high-pass  g  filters  for  several  interesting  wavelets  mentioned  in  the  Results  Section. 

Figures  14  and  15  show  the  h  and  g  filters,  respectively.  Note  that  the  X  wavelet  is  the  same  wavelet 
marked  on  the  horizontal  axis  in  Figures  10, 11,  and  12.  One  should  pay  more  attention  to  the  graphics  of 
the  g  filters,  since  they  produce  high  frequency  coefficients  for  quantization.  We  can  deem  that  g  filter 
essentially  performs  calculation  involving  the  positive  weight  multiplied  by  the  center  pixel  value  plus  the 
adjacent  pixel  values  on  the  two  sides  multiplied  by  the  negative  weights  of  the  g  filter.  Daubechies' 
wavelet  has  quite  balanced  negative  terms  at  the  two  sides  of  the  positive  weight  and  the  sum  of  negative 
weights  is  negatively  equal  to  the  positive  weight.  The  latter  is  a  constraint  in  all  wavelet  filters  anyway. 
In  addition,  the  absolute  value  of  gl(=-h2)  or  g2  ( =hl)  should  be  reasonably  large,  which  would  maintain 
the  low-pass  and  the  high-pass  characteristics  for  h  and  g  filters,  respectively.  In  fact,  those  wavelets  near 
Daubechies'  wavelet  including  the  one  (X  wavelet)  with  the  highest  performance  in  microcalcification 
features  possess  this  property.  From  the  signal  processing  point  of  view,  these  balanced  weights  in  a  filter 
are  very  important  characteristics  to  create  low  entropy  v^ues  for  general  texmres.  We  suspect  that  this 
property  may  have  something  to  do  with  the  so  called  "high  regularity"  in  the  wavelet  theory. 

In  short,  we  found  that  the  main  reason  that  a  wavelet  filter  can  produce  a  low  entropy  for  a  set  of 
data  is  because  the  weight  sum  of  the  g  filter  is  zero.  For  a  general  data  sequence,  the  g  filter  can  perform 
even  better  when 


(a)  the  absolute  value  of  gl(=-h2)  or  g2( =hl)  is  much  larger  than  that  of  other  weights. 

(b)  the  opposite  signed  weights  are  evenly  distributed  at  the  two  sides  of  gi  or  g2. 


Figure  14.  Low-pass  Filters  of  Several  Interesting  Wavelets. 


Figure  15.  High-pass  Filters  of  the  Same  Wavelets. 


For  low-noise  smooth  signals,  Harr’s  wavelet  may  slightly  outperform  the  others.  For  sharp  edges, 
Harr's  wavelet  would  greatly  outperform  the  others,  as  depicted  in  Figure  16  where  only  bones  as  well  as 
edges  between  bones  and  soft  tissues  isolated  on  computed  tomographic  (CT)  images  were  the  subjects 
for  the  evaluation. 


Figure  16.  Decomposition  Performance  of  Wavelets  on  CT  Head  Bones  and  Bone  Edges  (q=64). 


We  still  do  not  quite  understand  why  the  wavelet  possessing  low-pass  filter  (0.32252136, 
0.85258927,  0.38458542,  -0.14548269)  resulted  in  the  highest  feature  preservation.  However,  Figure  9 
has  provided  clues  as  to  where  MSEs  of  2-D  microcalcification  profiles  and  background  gradually  merge 
from  Harr's  to  Daubechies’  wavelets.  Since  contrast  and  SNR  values  are  computed  using  the  jjeak  and 
background  values  of  the  microcalcifications,  the  optimization  of  these  measures  should  occur  somewhere 
between  Harr's  and  Daubechies'  wavelets. 

In  the  field  of  compression,  it  is  known  that  the  higher  the  compression  ratio  is,  the  higher  the  error 
that  will  be  generated  in  the  decompressed  image.  However,  through  these  studies  we  discovered  a  new 
phenomenon  associated  with  these  two  main  quantitative  measures  in  compression.  We  found  that  higher 
compression  coincided  with  less  error  in  all  the  studies  (see  Figures  4, 5, 7,  &  16)  using  a  fixed  quantizer. 
This  may  be  because  high  compression  is  associated  with  low  entropy,  which  means  that  the  data  contains 
more  low  values  and  less  variation  between  the  originally  transformed  and  quantized  coefficients.  This 
phenomenon  happens  only  when  the  quantization  factor  is  fixed.  We  would  like  to  call  for  the  reader's 
attention  to  the  link  between  this  phenomenon  and  the  designed  error  function  that  comprises  MSE  and 
entropy  reduction  terms  for  training  the  convolution  neural  network.  With  this  concurrent  trend  (i.e.,  less 
error  is  associated  with  low  entropy  using  a  fixed  quantizer),  the  neural  network  can  be  effectively  trained. 
Otherwise  they  would  have  functioned  as  competing  factors  and  would  have  made  the  training  of  the 
neural  network  difficult. 

Although  we  have  shown  the  general  framework  of  a  wavelet  filter  search  using  a  neural  network 
training  method,  only  tap-4  wavelet  spectra  were  employed  in  our  experiment.  The  above  research 
findings  seem  able  to  be  gener^ized  for  high  order  wavelets  because  the  g  filter  is  the  key  operator  for  the 
wavelet  decomposition.  The  distribution  of  weights  for  high  order  wavelets  should  be  maintained  as 
discussed  above  in  order  to  obtain  a  low  entropy.  We  will  continue  to  investigate  the  performance  of  bi¬ 
orthonormal  wavelets  where  an  odd  number  of  weights  are  used.  We  predict  that  high  performance 
wavelets  in  compression  and  data  accuracy  should  possess  balanced  distribution  of  weights  in  the  g  filter 
of  bi-orthonorm^  wavelets. 
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In  our  previous  papers,  we  indicated  that  wavelet  (both  orthonormal  and  bi-orthonormal) 
decomposition  might  be  appropriate  for  low  resolution  small  images  such  as  the  Lena  image,  CTs  and 
MRIs.  For  high  resolution  large  images  such  as  digitized  chest  radiographs  and  mammograms,  we  found 
that  the  full-frame  DCT  performed  with  the  highest  compression  efficiency  This  is  because  the  DCT 
can  pack  highly  correlated  image  information  in  a  small  frequency  area.  The  DWT,  however,  requires 
many  levels  in  decomposition  to  achieve  a  high  compression  ratio.  The  data  inaccuracy  would  propagate 
from  high  level  wavelet  domains  to  low  level  and  to  the  reconstructed  image. 


6.  Conclusions 

A  neural  network  based  method  has  been  developed  to  search  for  optimal  wavelet  kernels  which 
can  produce  the  most  favorable  set  of  transform  coefficients  to  preserve  data  accuracy  and/or  defined 
image  features  during  the  compression.  In  this  paper,  our  technical  achievements  are:  (a)  development 
of  a  unified  method  to  facilitate  multichannel  wavelet  decomposition;  (b)  designing  a  cost  (error) 
function  consisting  of  MSE  and  imposed  entropy  reduction  function  for  training  the  convolution  neural 
network;  and  (c)  converting  neural  network  suggested  kernel  into  a  filter  constrained  by  the  wavelet 
requirements. 

In  all  medical  image  modalities  we  have  tested  so  far  (including  mammography,  CT,  MRI), 
Daubechies’  wavelet  (or  its  nearby  wavelets)  generally  performs  better  (in  most  cases  slightly  better)  > 
than  other  wavelts  for  image  compression  using  a  global  measure.  With  a  large  quantization  factor, 
Harr's  wavelet  produces  the  lowest  and  highest  MSEs  for  the  background  and  microcalcification  profile 
areas,  respectively.  However,  Daubechies'  wavelet  produces  an  opposite  result.  In  addition,  we  found 
that  the  wavelet  associated  with  a  low-pass  filter,  (0.32252136, 0.85258927, 0.38458542,  -0.14548269), 
possesses  the  highest  feature  preservation  capability  in  microcdcification  peak,  contrast,  and  SNR. 
Through  this  study,  we  also  found  that  only  Harr's  wavelet  sometimes  produced  a  dramatic  result, 
usually  optimization  occurs  on  a  band  of  wavelets  not  at  a  single  wavelet. 

We,  therefore,  conclude  that  Daubechies'  wavelet  (and  its  nearby  wavelets)  is  generally 
applicable  for  image  compression.  However,  Harr's  wavelet  is  suitable  for  low-noise  smooth  areas  and 
sharp  edges.  For  a  specific  image  pattern  such  as  microcalcifications  on  mammograms,  one  might  find  a 
wavelet  filter  can  most  preserve  the  features. 

By  reviewing  the  g  filters  of  various  wavelets,  we  found  those  optimal  wavelets  for  general 
image  texture  have  something  in  common.  They  possess  balanced  negative  terms  at  the  two  sides  of  the 
positive  weight  and  the  absolute  value  of  gl  or  g2  is  much  larger  than  that  of  the  other  weights. 
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Image  feature  selection  by  a  genetic  algorithm:  Application  to  classification 
of  mass  and  normal  breast  tissue 
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We  investigated  a  new  approach  to  feature  selection,  and  demonstrated  its  application  in  the  task  of 
differentiating  regions  of  interest  (ROIs)  on  mammograms  as  either  mass  or  normal  tissue.  The 
classifier  included  a  genetic  algorithm  (GA)  for  image  feature  selection,  and  a  linear  discriminant 
classifier  or  a  backpropagation  neural  network  (BPN)  for  formulation  of  the  classifier  outputs.  The 
GA-based  feature  selection  was  guided  by  higher  probabilities  of  survival  for  fitter  combinations  of 
features,  where  the  fitness  measure  was  the  area  under  the  receiver  operating  characteristic 
(ROC)  curve.  We  studied  the  effect  of  different  GA  parameters  on  classification  accuracy,  and 
compared  the  results  to  those  obtained  with  stepwise  feature  selection.  The  data  set  used  in  this 
study  consisted  of  168  ROIs  containing  biopsy-proven  masses  and  504  ROIs  containing  normal 
tissue.  From  each  ROI,  a  total  of  587  features  were  extracted,  of  which  572  were  texture  features 
and  15  were  morphological  features.  The  GA  was  trained  and  tested  with  several  different  parti¬ 
tionings  of  the  ROIs  into  training  and  testing  sets.  With  the  best  combination  of  the  GA  parameters, 
the  average  test  A,  value  using  a  linear  discriminant  classifier  reached  0.90,  as  compared  to  0.89  for 
stepwise  feature  selection.  Test  A.  values  with  a  BPN  classifier  and  a  more  limited  feature  pool 
were  0.90  with  GA-based  feature  selection,  and  0.89  for  stepwise  feature  selection.  The  use  of  a  GA 
in  tailoring  classifiers  with  specific  design  characteristics  was  also  discussed.  This  study  indicates 
that  a  GA  can  provide  versatility  in  the  design  of  linear  or  nonlinear  classifiers  without  a  trade-off 
in  the  effectiveness  of  the  selected  features.  ©  1996  American  Association  of  Physicists  in  Medi¬ 
cine. 

Key  words:  mammography,  computer-aided  diagnosis,  genetic  algorithms,  feature  selection 


I.  INTRODUCTION 

Computer-aided  diagnosis  (CAD)  for  detection  and  classifi¬ 
cation  of  breast  abnormalities  on  mammograms  is  an  active 
area  of  research.^  Clinical  studies  have  shown  that  10%  to 
30%  of  breast  cancers  visible  on  mammograms  in  retrospec¬ 
tive  studies  were  initially  missed  by  radiologists, and  that 
only  15%  to  30%  of  the  patients  who  have  undergone  biopsy 
due  to  a  suspicious  finding  on  mammograms  are  found  to 
have  breast  cancer.*^’^  CAD  methods  have  the  potential  of 
reducing  the  false-negative  rate  while  improving  the  positive 
predictive  values  of  the  mammographic  abnormalities. 

Masses  are  important  indicators  of  malignancy  on  mam¬ 
mograms.  In  recent  years,  considerable  effort  has  been  de¬ 
voted  to  the  development  of  computerized  methods  for  de¬ 
tection  and  classification  of  masses.^"*”  In  all  of  these 
investigations,  the  detection  or  classification  task  relies  on 
the  use  of  features  extracted  from  the  digitized  mammo¬ 
grams.  The  extracted  features  represent  properties  of  pixels 
(or  groups  of  pixels)  which  contain  characteristic  informa¬ 
tion  of  the  masses.  In  this  paper,  we  report  our  development 
of  a  computerized  method  for  classification  of  regions  of 
interest  (ROIs)  on  mammograms  as  either  masses  or  normal 
tissue,  with  particular  emphasis  on  a  genetic  algorithm  for 
feature  selection. 

Feature  selection  is  a  very  important  step  in 
classification,^’^®*^^'^^“^^  because  the  inclusion  of  inappropri¬ 


ate  features  often  adversely  affects  classifier  performance, 
especially  when  the  training  set  is  not  sufficiently  large.  The 
methods  employed  for  feature  selection  vary.  In  some 
approaches,^'^  very  few  features  were  used,  and  the  process 
of  feature  selection  was  not  clearly  described.  It  is  reasonable 
to  assume  that  the  features  were  selected  on  the  basis  of 
some  prior  knowledge  from  clinical  experience.  Wu  al.^^ 
selected  14  features  from  a  total  of  43  for  classification  of 
malignant  and  benign  masses,  and  observed  an  improvement 
in  classification  accuracy  when  the  reduced  feature  space 
was  used  instead  of  the  entire  feature  space.  The  criterion  for 
selection  was  the  difference  of  the  average  values  of  indi¬ 
vidual  features  between  the  two  classes.  Goldberg  et  al}'^ 
first  selected  five  features  from  a  total  of  26  based  on  the 
ability  of  the  individual  features  to  discriminate  between  ma¬ 
lignant  and  benign  masses.  Subsequently,  based  on  their 
pairwise  discriminatory  ability,  three  final  features  were  se¬ 
lected  from  the  remaining  five  features.  In  the  study  by  Chi- 
tre  et  al}^  the  criterion  for  texture  feature  selection  was  the 
combination  of  a  classification  error  and  a  clustering  tech¬ 
nique  using  individual  features  independently.  In  our  previ¬ 
ous  studies,  we  employed  a  stepwise  feature  selection  proce¬ 
dure  in  linear  discriminant  analysis  (LDA),^^’^^  in  which  a 
feature  is  included  or  excluded  at  each  step  based  on  a  cho¬ 
sen  statistical  criterion.  The  LDA  takes  into  account  the  cor¬ 
relation  between  the  features  and  the  joint  probability  distri- 
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bution  of  the  feature  vectors  in  the  multidimensional  feature 
space. 

Many  feature  selection  methods  have  been  explored  in 
CAD.  However,  the  best  method  which  can  provide  the  hish- 
est  accuracy  for  a  given  application  is  still  in  question  This 
IS  partly  becau.se  feature  selection  is  theoretically  a  difficult 
problem  It  is  well  known,  for  example,  that  the  two  inde¬ 
pendent  features  that  yield  the  highest  classification  accuracy 
m  a  feature  set  may  not  constitute  the  best  pair  of  features 
together.  In  the  training  process  in  CAD.  the  classifier  can 
be  designed  so  that  the  probability  of  trainina  error  will  not 
increase  when  the  number  of  selected  features  increases. 
However,  when  both  training  and  testing  are  desired  the 
problem  becomes  more  complicated  due  to  overfittin<^  Test 
results  can  deteriorate  when  the  number  of  selected  features 
increa.ses,  especially  when  the  number  of  training  cases  is 
small.  It  IS  imperative  to  .select  a  smaller  subset  of  features  to 
overcome  the  so-called  “curse  of  dimensionality’ (de¬ 
crease  in  classification  accuracy  of  the  test  set  with  an  in¬ 
creasing  number  of  features)  if  the  ratio  of  the  number  of 
training  cases  to  the  number  of  available  features  is  not  suf¬ 
ficiently  large  Several  recipes  for  feature  selection  are  men- 
loned  in  the  literature,  -  but  none  of  these,  except  for  an 
exhaustive  search  procedure,  is  optimal. 

Genetic  algorithms  (GAs),  first  introduced  by  Holland  in 
the  early  seventies.-  are  becoming  increasingly  popular  in 
solving  optimization  and  machine  learning  problems.-^-^ 
The  fundamental  pnnciple  underlying  GAs  is  based  on  natu¬ 
ral  selection.  To  solve  an  optimization  task,  a  GA  maintains 
a  population  of  bit  strings,  which  are  referred  to  as  chromo- 
somes.  Each  chromosome  corresponds  to  a  possible  solution 
of  the  problem.  In  each  generation  of  the  GA,  the  population 
IS  probabilistically  modified,  generating  new  chromosomes 
which  may  have  a  better  chance  of  solving  the  optimization 
prob  em.  GAs  have  been  applied  to  complex  optimization 
problems  such  as  the  control  of  a  gas-pipeline  system.'^  de- 
si^n  o  jet  engine  turbines,'®  training  of  a  backpropagation 
neural  network,-  feature  selection  for  an  artificial  neural 
network,  and  automated  detection  of  lung  nodules.'®  GAs 
usually  yield  nonoptimal,  but  near-optimal  solutions.  They 
are  thus  well-suited  for  feature  selection  problems  in  large 
eature  spaces,  where  the  optimal  solution  is  practically  im-  ' 

possible  to  compute,  and  a  near-optimal  solution  is  the  best  ‘ 
alternative. 

In  this  paper,  we  studied  the  ability  of  a  GA  to  select  ' 

eatures  from  a  large  feature  space.  Our  goal  was  to  intro-  ' 

tiZ  f  ^e^satile  feature  selection  mecha-  ^ 

nism.  The  effectiveness  and  the  versatility  of  the  GA  was  ^ 
demonstrated  by  its  application  to  the  problem  of  classifica-  ^ 
tion  of  ma.sses  and  normal  tissue  on  mammograms.  The  fea¬ 
ture  space  included  local  and  global  multiresolution  texture  ^ 
eatures  as  well  as  morphological  features.^*  The  rest  of  the 
paper  is  organized  as  follows.  In  the  next  section,  we  brieflv  s. 
discuss  important  components  of  a  GA.  In  Sec.  Ill  we  de  it 

scribe  our  image  database,  background  correction  method.  o 

extraction  of  texture  and  morphological  features,  and  the  GA  tf 
implementation  tor  feature  .selection.  In  Sec.  IV.  we  evaluate  sr 

the  dependence  of  the  classification  results  on  different  GA  fii 


parameters.  Section  V  contains  a  di.scussion  of  these  results 
hinally.  Sec.  VI  concludes  the  investigation  and  provides  a 
scope  for  further  research. 


II.  GENETIC  ALGORITHMS 


In  natural  evolution,  the  basic  problem  of  each  population 
US  to  find  beneficial  adaptations  to  a  complex  environment 
The  characteristics  that  each  individual  has  gained  or  inher¬ 
ited  are  carried  in  its  chromosomes  and  each  individual  re¬ 
produces  more  or  less  in  proportion  to  its  fitness  within  the 
environment.  Crossover  and  mutation  provide  the  possibilitv 
of  evolution  toward  better-fit  individuals. 

Genetic  algorithms'-'--’  apply  the  principles  of  natural  se¬ 
lection  to  machine  learning.  To  solve  an  optimization  prob¬ 
lem,  a  GA  requires  five  components,  which  are  analogous  to 
components  ot  natural  selection.  These  components  Ire  de¬ 
scribed  below. 

A.  Encoding 

Encoding  is  a  way  of  representing  the  decision  variables 
of  the  optimization  problem  in  a  string  of  binary  digits  called 
chromosomes.  If  there  are  v  decision  variables  in  In  optimi¬ 
zation  problem  and  each  decision  variable  is  encoded  as  an 
/I -digit  binary  number,  then  a  chromosome  is  a  string  of 
nXv  binary  digits.  Each  chromosome  is  a  possible  solution 
to  the  optimization  problem. 

B.  Initial  population 

The  initial  population  is  a  set  of  chromosomes  offered  as 
an  initial  solution  or  as  a  starting  point  in  the  search  for 
better  chromosomes.  The  initial  population  must  be  large  and 
diverse  enough  to  allow  evolution  toward  better  individuals. 

In  general,^  the  population  is  initialized  at  random  to  a  bit 
stnng  of  O’s  and  I's.  However,  more  directed  methods  for 
nding  the  initial  population  can  sometimes  be  used  to  im¬ 
prove  convergence  time. 

C,  Fitness  function 

The  fitness  function  rates  chromosomes  (i.e.,  possible  so¬ 
lutions)  in  terms  of  how  good  they  are  in  solving  the  optimi¬ 
zation  problem.  It  thus  plays  the  role  of  the  environment 
Ihe  fitness  function  returns  a  single  value  for  each  chromo¬ 
some  which  is  then  used  to  determine  the  probabilitv  that 
this  chromosome  will  be  selected  as  a  parent  to  generate  new 
chromosomes.  The  fitness  function  is  the  primary  GA  com¬ 
ponent  in  which  a  traditional  GA  is  tailored  to  a  specific 
problem.  ^ 
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D.  Genetic  operators 

Genetic  operators  are  applied  probabilistically  to  chromo- 
•somes  of  a  generation  to  produce  a  new  generation  of  chro¬ 
mosomes.  Three  basic  operators  are  parent  selection,  cross¬ 
over,  and  mutation.  The  parent  selection  operation  mimics 
the  natural  selection  process  by  selecting  which  chromo¬ 
somes  will  be  used  to  create  a  new  generation,  where  the 
fittest  chromosomes  reproduce  most  often.  The  crossover  op- 


1 


Sahiner  et  ai\  Feature  selection  by  genetic  algorithm 


1673 


eracibn  refers  to  the  exchange  of  substrings  of  two  chromo¬ 
somes  to  generate  two  new  offspring.  After  parents  are  se¬ 
lected,  and  crossover  generates  two  new  chromosomes,  the 
operation  of  mutation  is  applied  to  each  bit  in  the  string. 
Mutation  simply  alters  the  binary  value  of  the  bit  when  a 
random  value  generated  for  the  bit  is  less  than  a  predefined 
mutation  rate. 

E.  Working  parameters 

A  set  of  parameters,  which  includes  the  number  of  chro¬ 
mosomes  in  each  generation,  the  crossover  rate,  the  mutation 
rate,  and  the  stopping  criterion,  is  predefined  to  guide  the 
GA.  The  crossover  and  mutation  rates,  assigned  as  real  num¬ 
bers  between  0  and  1,  are  used  as  thresholds  to  determine 
whether  the  operators  will  be  applied  or  not.  The  stopping 
jriterion  is  predefined  as  the  number  of  generations  the  al¬ 
gorithm  is  to  be  run  or  as  a  tolerance  value  for  the  fitness 
function. 

Two  forces,  exploration  and  exploitation,  interact  in  the 
search  for  better-fit  chromosomes.  Exploitation  occurs  in  the 
form  of  parent  selection.  Chromosomes  with  higher  fitness 
exploit  this  fitness  by  reproducing  more  often.  Exploration 
occurs  in  the  form  of  mutation  and  crossover,  which  allow 
the  offspring  to  achieve  a  higher  fitness  than  their  parents. 
Crossover  is  the  key  to  exploration,  whereas  mutation  pro¬ 
vides  background  variation  and  occasionally  introduces  ben¬ 
eficial  genes  into  the  chromosomes.  For  a  successful  GA, 
exploration  and  exploitation  have  to  be  in  good  balance. 
With  too  much  exploitation,  the  GA  may  be  stuck  with  cop¬ 
ies  of  the  same  chromosome  after  a  few  generations,  whereas 
with  too  much  exploration,  good  genes  may  never  be  able  to 
accumulate  in  the  genetic  pool. 

GAs  are  ideal  for  sampling  large  search  spaces  and  locat¬ 
ing  the  regions  of  enhanced  opportunity.  Although  GAs  yield 
near-optimal  solutions  rather  than  optimal  ones,  obtaining 
such  near-optimal  solutions  are  usually  the  best  that  one  can 
do  in  many  complex  optimization  problems  involving  large 
numbers  of  parameters. 

III.  METHODS 
A.  Data  set 

The  mammograms  used  in  this  study  were  randomly  se¬ 
lected  from  the  files  of  patients  who  had  undergone  biopsy  in 
the  Department  of  Radiology  at  the  University  of  Michigan. 
The  criterion  for  inclusion  of  a  mammogram  in  the  data  set 
was  that  the  mammogram  contained  a  biopsy-proven  mass. 
To  avoid  the  effect  of  repetitive  grid  lines  on  the  image  tex¬ 
ture,  mammograms  that  contained  these  grid  lines  caused  by 
the  stationary  grid  of  some  older  mammographic  units  were 
excluded.  The  data  set  included  168  mammograms,  with  a 
mixture  of  benign  (n  =  85)  and  malignant  (n=83)  masses. 
The  visibility  of  the  masses  was  ranked  by  an  experienced 
breast  radiologist  on  a  scale  of  I  to  10.  where  a  ranking  of  1 
corresponded  to  the  most  visible  category.  The  distribution 
of  the  visibility  ranking  of  the  masses  is  shown  in  Fig.  1.  It 
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Fig.  1.  The  distribution  of  the  visibility  ranking  of  the  masses  in  the  data  set. 


can  be  observed  that  the  visibility  of  the  masses  in  our  data 
set  ranged  from  subtle  to  obvious. 

The  mammograms  w’ere  digitized  with  a  LUMISYS  DIS- 
1000  laser  scanner  at  a  pixel  resolution  of  100  yumX  100  fim 
and  4096  gray  levels.  The  digitizer  was  calibrated  so  that 
gray  level  values  were  linearly  and  inversely  proportional  to 
the  optical  density  (OD)  within  the  range  of  0.1-  to  2.8-OD 
units,  with  a  slope  of  -O.OOl-OD/pixel  value.  Outside  this 
range,  the  slope  of  the  calibration  curve  decreased  gradually. 
The  OD  range  of  the  digitizer  was  0  to  3.5. 

Four  different  ROIs,  each  with  256X256  pixels,  were  se¬ 
lected  from  each  mammogram.  One  of  the  selected  ROIs 
contained  the  true  mass  as  identified  by  an  experienced  radi¬ 
ologist  and  verified  by  biopsy.  In  addition  to  the  ROI  that 
contained  the  true  mass  location,  the  radiologist  in  the  study 
was  asked  to  select  three  presumably  normal  ROIs  from  the 
mammogram.  The  first  of  these  three  ROIs  contained  prima¬ 
rily  dense  tissue  which  could  mimic  a  mass  lesion,  the  sec¬ 
ond  ROI  contained  mixed  dense/fatty  tissue,  and  the  third 
contained  mainly  fatty  tissue.  An  example  of  each  of  these 
ROIs  is  shown  in  Fig.  2. 

B.  Background  correction 

Breast  masses  are  superimposed  on  structured  background 
tissue  in  the  ROIs.  In  most  cases,  this  background  tissue  is 
not  uniform  over  our  256x256  pi.xel  ROI.  For  example,  one 
side  of  the  ROI  may  contain  denser  tissue  than  the  other  side, 
or,  when  the  mass  is  close  to  the  outer  edge  of  the  breast,  one 
comer  of  the  ROI  may  contain  a  nonbreast  region.  This  non¬ 
uniformity  may  affect  texture  and  morphological  features 
that  are  extracted  from  the  ROI.  To  reduce  this  effect,  w’e 
developed  a  correction  method  that  estimated  the  low- 
frequency  background  level  based  on  the  image  intensities  in 
a  band  of  pixels  surrounding  the  ROI.  The  background  level 
at  each  pixel  on  the  edge  of  the  ROI  was  first  estimated  by 
gray-level  averaging  in  a  rectangular  region  surrounding  the 
pixel.  The  background  level  of  a  pixel  inside  the  ROI  was 
then  estimated  by  interpolation  using  the  background  pixel 
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Fig.  2.  An  example  of  the  mass  and  normal  ROIs  selected  from 
right  mixed  dense/fatty  tissue:  lower  left — dense  tissue;  lower  ri 


one  of  the  mammograms  used  in  this  study.  The  four  ROIs 
ght — fatty  tissue. 


are  upper  left — mass;  upper 


values  on  the  edges.  A  more  detailed  description  of  this 
background  correction  method  can  be  found  in  the 
literature. 

C.  Feature  extraction 
1.  Texture  features 

The  texture  features  used  in  this  study  were  calculated 
from  spatial  gray-level  dependence  (SOLD)  matrices.  The 
(ij)th  element  of  an  SOLD  matrix  is  the  joint  probability 
that  gray  levels  i  and  j  occur  in  a  direction  ^  at  a  distance  oV 
d  pixels  apart  in  an  image.  We  computed  global  texture  fea¬ 
tures,  which  represent  the  average  texture  measures  through¬ 
out  the  entire  ROI,  and  local  texture  features,  which  repre¬ 
sent  (i)  the  texture  measure  of  a  denser  subregion  inside  the 
ROI  which  is  likely  to  contain  the  mass,  and  (ii)  the  texture 
difference  between  this  subregion  and  other  peripheral  re¬ 
gions  in  the  ROI  which  contain  normal  breast  tissue.  The 
method  used  for  the  computation  of  SOLD  matrices  and 
multiresolution  texture  analysis  are  explained  in  full  detail 
elsewhere.^*’  A  brief  description  is  given  below. 

Wavelet  transform^^  using  the  four-coefficient  Dau- 
bechies  wavelet  filter  was  applied  to  each  ROI  to  decompose 
the  image  into  a  low-pass  image  and  three  high-pass  subband 
images.  For  extracting  global  multiresolution  texture  fea¬ 
tures.  we  used  the  original  image  (scale=l)  and  the  low-pass 


images  at  scales  2  and  4  to  formulate  SOLD  matrices  at  d=  1 
in  the  transformed  images.  The  distance  of  d=l  at  these 
scales  was  equivalent  to  distances  of  1,  2,  and  4  in  the  origi¬ 
nal  image.  The  wavelet  coefficients  at  scale  8  were  obtained 
with  wavelet  filtering  but  without  down-sampling.  The  coef¬ 
ficients  at  scale  8  were  used  to  formulate  SOLD  matrices  at 
d=2,3,4,:;l2.  Since  no  down-sampling  was  used  at  scale  8, 
these  distances  between  pixel  pairs  were  equivalent  to  dis¬ 
tances  of  8,12,16,. ..,48  in  the  original  image.  Thus  a  total  of 
14  distances  were  used.  At  each  distance,  four  SOLD  matri¬ 
ces  at  e=0°,  45=,  90°,  and  135°  were  determined.  Thirteen 
texture  features  were  calculated  from  each  SOLD  matrix. 
The  features  at  ^=0°.  90°  and  at  0=45°,  135°  were  averaged 
separately.  Thus  26  texture  features  were  computed  for  each 
d,  resulting  in  a  total  of  364  global  features. 

For  extracting  local  texture  features,  five  subregions  were 
automatically  identified  in  the  background-corrected  ROI:  a 
90X90  pixel  object  subregion  that  contained  the  suspicious 
dense  tissue  or  the  mass,  and  four  64X64  pixel  peripheral 
subregions  that  were  located  in  the  four  comers  of  the  ROI. 
The  suspicious  object  subregion  was  automatically  detected 
by  searching  for  the  highest  average  gray-level  inside  the 
ROI  using  a  90X90  moving  box.  For  a  given  d,  an  SOLD 
matrix  was  derived  from  the  object  subregion,  and  a  back¬ 
ground  SOLD  matrix  was  derived  from  the  pixel  pairs  in  the 
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Terminate 


Fig.  3.  A  schematic  of  the  clustering  algorithm. 


four  peripheral  subregions.  The  SGLD  matrices  were  com¬ 
puted  at  d={,2A.  and  8.  Analogous  to  the  global  texture 
feature  extraction,  for  a  given  d,  26  features  were  computed 
from  the  object  SGLD  matrix,  and  26  features  were  com¬ 
puted  from  the  background  SGLD  matrix.  The  local  texture 
feature  space  therefore  consisted  of  104  features  extracted 
from  the  object  subregion,  and  104  features  calculated  as  the 
differences  between  the  corresponding  features  extracted 
from  the  object  and  peripheral  subregions.  This  resulted  in  a 
total  of  208  local  texture  features. 

The  detail  images  in  the  wavelet  transform  can  be  ex¬ 
pected  to  contain  useful  information  for  texture-based  clas¬ 
sification  of  a  large  class  of  images.  However,  in  our  previ¬ 
ous  studies,  we  found  that  using  the  SGLD  texture  features 
based  on  the  detail  images  in  the  wavelet  transform  domain 
did  not  result  in  proper  classification  of  breast  masses  and 
normal  breast  tissue.^ ^  Since  this  study  focused  on  the  fea¬ 
ture  selection  aspect  of  classification,  we  did  not  attempt  to 
search  for  new  texture  features  that  are  presumably  present 
in  the  detail  images. 

2.  Morphological  features 

We  have  developed  an  automated  algorithm  for  segmen¬ 
tation  of  an  ROI  into  an  object  region  and  background 
tissue.^*  The  morphological  features  are  extracted  automati¬ 
cally  from  the  object  region  after  the  segmentation  is  per¬ 
formed. 

We  used  a  pixel-by-pixel  clustering  algorithm  followed 
by  binary  object  detection  for  ROI  segmentation.  Pixel-by¬ 
pixel  clustering  algorithms  have  found  widespread  use  in 
segmentation  of  remote  sensing  data,^*^  where  multispectral 
and/or  multisource  data  are  obtained  for  each  pixel  in  the 
image.  Data  points  for  each  pixel  are  regarded  as  compo¬ 
nents  of  a  multidimensional  feature  vector,  and  pixels  with 
feature  vectors  of  similar  characteristics  are  assigned  to  the 
same  class  using  a  clustering  algorithm.  Our  data  set  con¬ 
tains  a  single  data  point  (the  gray  level)  for  each  pixel.  We 
derived  several  filtered  images  from  this  single  image,  and 
used  the  original  and  filtered  pixel  values  as  the  components 
of  the  feature  vectors  in  the  clustering  algorithm.  Inclusion 
of  the  filtered  images  makes  it  possible  to  incorporate  neigh¬ 
borhood  information  into  the  classification  of  each  pixel. 

Our  clustering  algorithm,  depicted  in  Fig.  3,  is  very  simi¬ 
lar  to  the  migrating  means  algorithm.^*^  The  goal  is  to 
classify  pixel  pi  as  either  an  object  or  a  background  pixel. 
This  is  achieved  by  clustering  with  feature  vector 
Fi=[f(  1 ) . f(L)]  of  length  L,  where  L  is  the  total  number 


of  images  used  in  clustering.  The  algorithm  starts  by  choos¬ 
ing  initial  cluster  center  vectors,  for  the  object  and  the  back¬ 
ground,  as  described  below.  Let  [C(,(  1 and 
q=[c^(1),...,c^(L)]  denote  these  cluster  center  vectors, 
respectively.  Let  d^iO  denote  the  Euclidean  distance  be¬ 
tween  Fi  and  C^.d^ii)  denote  the  Euclidean  distance  be¬ 
tween  Fi  and  ,  and  R  denote  a  constant  distance  ratio.  If 
d^{i)ld^{i)>R,  the  pixel  Pi  is  temporarily  classified  as  an 
object  pixel;  otherwise,  it  is  classified  as  a  background  pixel. 
If  /?  =  1,  the  algorithm  becomes  identical  to  the  migrating 
means  algorithm.  After  this  temporary  classification,  two 
new  cluster  center  vectors  are  computed.  The  /th  component 
of  the  new  object  and  background  center  vectors  are  the  av¬ 
erages  of  the  /th  components  for  pixels  temporarily  classified 
as  object  and  background  pixels,  respectively.  If  the  new 
cluster  centers  are  different  from  the  previous  ones,  the  pro¬ 
cedure  of  temporary  classification  is  repeated,  otherwise,  the 
clustering  is  completed.  In  this  paper,  we  used  R=2.15  so 
that  F I  had  to  be  much  closer  to  than  to  to  be  classi¬ 
fied  as  an  object  pixel.  This  conservative  criterion  reduces 
the  chance  that  a  mass  region  merges  with  adjacent  tissue. 
However,  it  also  slightly  underestimates  the  mass  size  so  that 
the  detected  edge  is  often  within  the  margin  of  the  mass.  The 
initial  center  vectors  were  chosen  such  that  each  component 
of  the  initial  object  center  vector  is  1.1  times  the  rwerage  of 
that  component  over  the  entire  ROI,  and  each  component  of 
the  initial  background  center  vector  is  0.9  times  the  same 
average. 

After  clustering,  the  ROI  may  contain  several  discon¬ 
nected  objects.  To  obtain  a  single  suspected  mass  object,  we 
selected  the  largest  connected  object  among  all  detected  ob¬ 
jects.  We  finally  applied  region  growing  to  a  small  region 
outside  the  boundary  of  the  suspected  object  to  get  a  better 
definition  of  its  borders.  To  achieve  this,  we  thresholded  the 
original  image  pixels  that  were  within  ten  pixels  of  the  object 
border.  The  threshold  value  was  chosen  experimentally  to  be 
the  difference  between  the  mean  of  the  pixel  values  inside 
the  object  and  half  of  their  standard  deviation.  Figure  4 
shows  an  example  of  the  result  of  our  segmentation 
algorithm. 

In  this  paper,  w'e  used  three  filtered  images  along  with  the 
original  image  to  form  the  feature  vectors.  The  first  filtered 
image  was  obtained  by  median  filtering  with  a  5X5  kernel. 
The  second  and  third  filtered  images  were  edge-enhanced 
images  at  different  resolutions.^^  Each  filtered  image,  as  well 
as  the  original  image  was  linearly  normalized  between  0  and 
Si ,  where  5/ ,  /  =  1...L  is  a  scaling  factor.  The  scaling  factors 
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(a) 


(b) 


Fig.  4.  (a)  An  ROI  with  an  ill-defined  mass,  (b)  mass 
corrected  ROI. 


object  extracted  automatically  by 


the  clustering  algorithm  and  superimposed  on  the  background- 


S,  were  chosen  experimentally  to  be  5,  =  5,  =  1400  for  the 
original^  image  and  the  median  filtered  image,  and 
^3-S4-nO  for  the  edge-enhanced  images.  Therefore,  the 
original  image  and  the  median  filtered  image  were  weighted 
approximately  twice  as  much  as  the  edge-enhanced  iinases 
m  the  clustering  algorithm.  This  bias  in  favor  of  the  weights 
of  the  original  image  and  the  median  filtered  image  was  nec¬ 
essary  because  the  algorithm  showed  a  tendency  to  segment 
only  disconnected  edges  if  all  images  were  equally  weighted. 

After  detection  of  a  single  suspicious  object  within  each 
ROI,  features  were  extracted  from  the  object  and  its  margins. 
VVe  extracted  eleven  shape  features  from  each  object,  and 
four  feamres  from  the  margins  of  each  object.  The  shape 
features  included  the  number  of  edge  pixels,  area,  circularity 
rectangularity,  contrast,  the  ratio  of  the  number  of  edge  pix¬ 
els  to  the  area,  and  five  normalized  radial  length  features.  A 
detailed  discussion  of  the  shape  features  used  in  this  studv 
can  be  found  in  Ref.  35.  The  margin  features  were  computed 
as  follows.  First,  the  mean  and  the  standard  deviation  of  the 
pi.xel  values  inside  the  object  were  computed.  Next,  pixels  in 
a  boundary  region  outside  the  object  but  within  a  distance  of 
15  pixels  from  the  object  border  were  thresholded.  The  val¬ 
ues  of  the  thresholds  were  chosen  to  be  the  mean  minus  0.5, 

1,  1.5,  and  2  times  the  standard  deviation.  The  number  of 
pixels  in  the  boundary  region  which  were  above  the  thresh¬ 
olds  was  defined  as  the  margin  features.  Thus  a  total  of  15 
morphological  features  were  extracted  from  each  ROI. 


D.  Classifiers 

In  this  paper,  we  investigated  GA-based  feature  selection 
for  two  kinds  of  classifiers,  namely  (i)  a  linear  classifier 
ba.sed  on  Fisher’s  linear  discriminant;*®  and  (ii)  a  multilayer 
backpropagation  neural  network  (BPN).^®  For  each  ROI, 
both  classifiers  produced  a  scalar,  termed  the  classifier  out¬ 
put,  which  indicated  the  likelihood  that  the  ROI  contained  a 
real  mass. 


Fisher’s  linear  discriminant  is  based  on  a  linear  projection 
of  the  feature  space  onto  the  real  line  such  that  the  ratio  of 
the  between-class  sum  of  squares  to  within-class  sum  of 
squares  is  maximized  after  the  projection.*®  In  our  two-class 
problem,  the  statistical  procedure  for  formulation  of  the  lin¬ 
ear  discnminant  function  is  equivalent  to  multiple  linear 
regression.  Fisher’s  linear  discriminant  is  the  optimal  clas¬ 
sifier  if  the  features  are  distributed  as  multivariate  Gaussian 
random  variables  with  equal  covariance  matrices  under  each 
class.* 

The  BPN  used  in  this  study  consisted  of  an  input  layer,  an 
output  layer,  and  a  single  hidden  layer.  Each  layer  in  the 
BPN  contained  a  number  of  nodes,  which  were  connected  to 
previous  and  subsequent  layers  by  trainable  weights.  A 
single  feature  was  applied  to  each  node  in  the  input  layer. 
The  net  input  to  each  node  in  the  hidden  layer  and  the  output 
layer  was  a  weighted  sum  of  the  node  outputs  from  the  pre¬ 
vious  layer.  The  output  of  a  node  was  related  to  its  net  input 
by  a  sigmoidal  function.  The  output  layer  contained  a  single 
node,  whose  output  indicated  the  likelihood  that  the  ROI 
contained  breast  mass  tissue.  The  BPN  was  trained  using 
batch  processing  and  the  delta-bar-delta  rule  for  improved 
rate  of  convergence  and  stability.^* 

Since  our  purpose  in  this  study  is  to  design  a  feature  se¬ 
lection  algorithm,  we  did  not  compare  BPN  and  linear  dis¬ 
criminant  classifiers.  Instead,  we  compared  the  classification 
accuracy  obtained  by  using  different  feature  selection  meth¬ 
ods,  with  a  fixed  classifier  for  each  comparison. 


E.  GA-based  feature  selection 

In  this  paper,  we  used  a  GA  to  select  features  for  discrimi¬ 
nation  of  mass  and  nonmass  ROIs.  In  our  GA,  the  number  of 
bits  in  a  chromosome  was  equal  to  the  total  number  of  avail¬ 
able  features,  and  each  bit  corresponded  to  an  individual  fea¬ 
ture  extracted  from  the  ROIs.  A  feature  was  termed 
“present”  in  a  chromosome  if  the  value  of  the  bit  corre- 
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spending  to  that  feature  was  1.  The  population  was  initial¬ 
ized  at  random,  with  a  small  probability  of  having  a  1  at 
each  bit  location.  This  allowed  the  GA  to  start  with  a  few 
selected  features  and  grow  to  a  reasonable  number  of  fea¬ 
tures  as  the  population  evolved.  The  total  number  of  chro¬ 
mosomes  at  each  generation  was  kept  constant  at  M  =  250. 

At  each  run  of  the  GA,  the  image  data  set  of  672  ROIs 
was  divided  into  a  training  and  a  test  set,  with  ROIs  belong¬ 
ing  to  the  same  film  grouped  into  the  same  set.  The  training 
set  was  used  in  the  GA  for  feature  selection.  After  feature 
selection,  a  classifier  was  trained  using  only  the  GA-selected 
features  of  the  training  set.  The  classification  accuracy  of  the 
procedure  was  evaluated  by  applying  the  classifier  to  the 
same  set  of  features  of  the  test  group,  as  described  below. 
For  studying  the  effect  of  GA  parameters  on  the  classifica¬ 
tion  accuracy  with  the  linear  discriminant  classifier,  ten  ran¬ 
dom  partitionings  of  training  and  test  sets  were  obtained  for 
each  set  of  different  GA  parameters,  and  the  results  were 
averaged  in  order  to  reduce  the  effect  of  case  selection.  For 
experiments  with  the  BPN,  50  random  partitionings  were 
used.  For  both  experiments,  the  number  of  mass  and  non¬ 
mass  ROIs  in  each  training  set  was  126  and  378  (I  of  the 
total),  respectively,  while  the  number  of  mass  and  nonmass 
ROIs  in  each  test  set  was  42  and  126  (i  of  the  total), 
respectively. 

Inside  the  GA,  the  training  set  was  equally  divided  into 
two  groups,  51  and  52.  For  each  chromosome,  two  classifi¬ 
ers  were  trained,  with  51  and  52  as  the  training  groups, 
respectively.  Only  the  features  present  in  the  chromosome 
were  used  as  features  in  classifier  training.  The  classifier 
trained  on  group  51  was  applied  to  the  group  52,  and  vice 
versa,  for  calculation  of  two  sets  of  pseudotest  classifier  out¬ 
puts.  The  accuracy  of  the  pseudotest  classifier  outputs,  and 
the  number  of  selected  features  were  then  used  to  define  the 
fitness  of  the  individual  chromosome.  This  process  was  re¬ 
peated  for  each  of  the  M  chromosomes  in  each  generation. 

The  main  component  of  the  fitness  function  was  the  area 
A-  under  the  receiver  operating  characteristic  (ROC)  curve 
of  the  pseudotest  sets.  A  widely  accepted  procedure  for  com¬ 
puting  the  ROC  curve  assumes  that  the  classifier  output  fol¬ 
lows  a  normal  distribution  for  each  class,  and  fits  the  ROC 
curve  to  the  classifier  output  using  maximum  likelihood 
estimation.^^  We  adopted  this  approach  when  we  studied  and 
compared  the  classification  accuracy  of  our  classifiers  with 
the  selected  feature  sets.  However,  it  is  computationally  ex¬ 
pensive  to  use  this  approach  in  the  fitness  function  calcula¬ 
tion  inside  the  GA,  because  it  is  required  for  each  chromo¬ 
some  in  each  generation.  Instead,  we  chose  to  estimate  the 
ROC  curve  by  varying  the  decision  threshold,  and  determin¬ 
ing  the  true-positive  fraction  (TPF)  as  a  function  of  the  false¬ 
positive  fraction  (FPF).  The  A,  value  was  estimated  by  nu¬ 
merical  integration  using  the  trapezoidal  rule.  Since  the 
estimation  of  the  A,  was  internal  to  the  GA,  it  did  not  affect 
the  A-  values  reported  in  the  Sec.  IV  for  a  set  of  selected 
features.  Internal  to  the  GA.  the  fitness  ranking  of  the  chro¬ 
mosomes  might  be  slightly  different  from  that  obtained  by 
using  the  maximum  likelihood  ROC  curve.  However,  the 
effect  on  the  final  selected  feature  set  should  be  small,  be- 
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cause  this  slight  difference  did  not  completely  eliminate  the 
lower-ranking  chromosomes.  A  slightly  lower-ranking  chro¬ 
mosome  was  assigned  a  slightly  lower  probability  of  being  a 
parent,  but  it  could  still  be  competitive  after  mutation  and 
crossover  if  it  contained  effective  features.  This  minor  inac¬ 
curacy  in  the  fitness  function  computation  was  a  trade-off  in 
order  to  execute  the  computation  in  a  reasonable  amount  of 
time  while  using  the  A,  value  in  the  feature  selection 
procedure, 

A  second  component  of  the  fitness  function  was  a  penalty 
term,  analogous  to  Brill's  utility  term,“^  which  was  linearly 
proportional  to  the  number  of  features  present  in  the  chro¬ 
mosome.  The  purpose  of  this  penalty  term  was  to  control  the 
number  of  selected  features  and  to  prevent  overfitting  in  the 
test  stage  of  classifier  design.  In  other  words,  the  penalty 
term  was  designed  to  improve  the  classification  accuracy, 
and  not  for  accelerating  the  computational  speed.  The  func¬ 
tion  of  the  penalty  term  was  comparable  to  those  of  the  F-to- 
enter  and  F-to-remove  thresholds  in  the  stepwise  feature  se¬ 
lection  method,  described  in  the  next  subsection.  Similar  to 
these  corresponding  parameters  in  stepwise  feature  selection, 
increasing  the  penalty  term  decreased  the  number  of  selected 
features.  We  studied  the  effect  of  the  presence  of  this  penalty 
term  on  the  test  results. 

In  a  given  generation,  the  fitness  function  f{m)  for  a 
chromosome  m  was  computed  as  follows.  First,  the  two 
pseudotest  A,  values,  corresponding  to  pseudotest  sets  51 
and  SI,  were  averaged  to  yield  A-(m).  Next,  a  fitness  func¬ 
tion /(m)  was  computed  as 


f{m)=A.{m)-aN[m),  (1) 

where  N{m)  was  the  number  of  I’s  (present  feature^)  in 
chromosome  m  and  a  was  the  penalty  constant.  Aft^r  /(m) 
was  determine^for  all  chromosomes,  the  maximum  and 
the  minimum  of  /(m)  over  th£  population  of  M  chro¬ 
mosomes  wer£  calculated.  Finally, /(m)  was  normalized  us- 
i^o/max  /min  yield  the  fitness  function /(m), 


f{m)  = 


/('«)-/min\ 
/"max” /min  / 


1  ^m^M. 


The  genetic  operators  were  applied  as  follows.  First,  par¬ 
ent  selection  was  performed  using  roulette  wheel  selection."^ 
In  this  method,  each  chromosome  in  a  generation  occupies 
an  area 


A(m)  = 


(3) 


proportional  to  its  fitness,  on  a  roulette  wheel.  A  parent  is 
selected  by  spinning  the  roulette  wheel,  i.e.,  by  generating  a 
random  number  %  e(0,l]  and  determining  the  chromosome 
ni^  that  satisfies 


2  A(m)<%=^2  A(/n),  /=1,2, 


After  two  parents  and  rtii  were  selected  for  generating 
two  offspring,  a  probabilistic  decision  was  made  as  to 
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whether  crossover  should  be  applied  or  not.  A  random  num¬ 
ber  /3  with  uniform  distribution  in  the  interval  (0.1]  w-as  sen¬ 
erated  and  compared  to  ,  the  probability  of  crossover.  If 
then  no  crossover  was  applied,  and  and  m.  were 
accepted  into  the  new  generation.  Otherwise,  a  raindom 
crossover  site  was  selected  inside  the  chromosomes,  and 
each  of  the  parent  chromosomes  were  split  into  left  and  risht 
strings  at  this  location.  Crossover  was  completed  by  combin¬ 
ing  the  left  string  of  wt,  with  the  right  string  of  W;,  and  vice 
versa. 

Finally,  mutation  was  applied  to  each  bit  of  the  chromo¬ 
somes  in  the  new  generation.  Again,  a  random  number  with 
uniform  distribution  in  the  intervml  (0,1]  was  generated,  and 
compared  to  P,„.  the  probability  of  mutation.  If  P^  was 
higher,  then  the  bit  was  complemented.  Otherwise,  it  was  left 
unchanged.  We  studied  the  effects  of  P,  and  P,„  on  the  final 
classification  accuracy. 

The  GA  was  permitted  to  evolve  for  a  fi.xed  number  of 
generations.  After  the  evolution  was  completed,  the  chromo¬ 
some  with  the  highest  fitness  value  provided  the  set  of  se¬ 
lected  features.  The  entire  training  set  51 U52  was  then  used 
in  the  final  multiple  linear  regression  to  determine  the  weiaht 
of  each  selected  feature  in  the  classifier.  During  testina.  the 
values  of  the  selected  features  of  each  ROI  in  the  test  set 
were  applied  as  inputs  to  the  trained  classifier  to  calculate  the 
classifier  output  for  that  ROI. 

To  evaluate  the  classification  performance,  the  classifier 
output  was  used  as  the  decision  variable,  and  a  test  ROC 
curve  was  estimated  using  the  LABROCl  program.^®  The 
LABROCl  program  assumes  binormal  distributions  of  the 
decision  variable  for  the  normal  and  abnormal  cases,  and  fits 
the  ROC  curve  based  on  maximum  likelihood  estimation. 
The  area  under  the  fitted  ROC  curve.  A.,  was  used  as  an 
index  of  classification  accuracy. 

F.  Stepwise  feature  selection 

For  the  purpose  of  comparison  with  GA-based  feature  se¬ 
lection,  we  also  studied  the  classification  accuracy  of  the 
same  classifiers  using  a  well-established  feature  selection 
method,  called  feature  selection  with  stepwise  linear  dis¬ 
criminant  analysis,-’  or  stepwise  feature  selection  in  shon.'° 
At  each  step  of  the  stepwise  selection  procedure,'  one  feature 
is  entered  into  or  removed  from  the  selected  feature  pool  by 
analyzing  its  effect  on  a  selection  criterion.  In  this  study,  we 
employed  the  Wilks  lambda  as  our  selection  criterion, 
which  is  defined  as  the  ratio  of  the  within-group  sum  of 
squares  to  the  total  sum  of  squares  of  the  two  classes. The 
number  of  features  selected  by  this  method  are  controlled  bv 
two  parameters,  called  P-to-enter  and  P-to-remove.  At  each 
step,  the  stepwise  feature  selection  algorithm  first  determines 
the  significance  of  the  change,  based  on  P  statistics,  in 
Wilks’  lambda  when  a  variable  is  entered  into  the  selected 
leaiure  pool.  If  the  significance  is  above  the  threshold  deter¬ 
mined  by  the  P-to-enter  parameter,  then  the  selected  feature 
pool  is  augmented  with  the  most  significant  variable.  Next, 
the  algorithm  computes  the  significance  of  the  change  in 
Wilks'  lambda  when  each  variable  is  removed  from  the  .se- 
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lected  feature  pool.  It  the  significance  is  below  the  threshold 
determined  by  the  P-to-remove  parameter,  then  the  least  sig¬ 
nificant  variable  is  removed  from  the  selected  feature  pool 
Increasing  either  the  P-to-enter  or  the  P-to-remove  value 
decreases  the  number  of  selected  features.  Similar  to  GA- 
based  feature  selection,  stepwise  feature  selection  is  a  heu¬ 
ristic  procedure.  For  this  reason,  the  optimal  values  of  P-to- 
enter  and  P-to-remove  parameters  are  not  known  in  advance. 
One  has  to  experiment  with  these  parameters  and  increase  or 
decrease  the  number  of  selected  features  to  obtain  the  best 
test  performance.  A  detailed  description  of  the  stepwise  fea¬ 
ture  selection  procedure  and  its  application  to  our 
problems’®  "  can  be  found  in  the  literature.-’-'’® 

IV.  RESULTS 

In  the  next  two  subsections,  we  present  the  results  for 
evaluation  of  the  effects  of  various  parameters,  and  for  clas¬ 
sification  with  GA-based  feature  selection  using  linear  dis- 
cnminant  and  BPN  classifiers,  respectively.  Since  training  a 
linear  discriminant  classifier  was  considerably  faster  than 
training  a  BPN,  the  effects  of  GA  parameters  were  studied 
with  a  linear  discriminant  classifier.  Feature  selection  for  a 
BPN  classifier  was  performed  on  a  subset  of  the  entire  fea¬ 
ture  set  to  accelerate  training.  For  both  classifiers,  a  compari¬ 
son  with  stepwise  feature  selection  was  provided. 

A.  Feature  selection  for  a  linear  discriminant  classifier 

1.  Effect  of  penalty  term  and  number  of 
generations 

To  determine  a  reasonable  number  of  generations  for  the 
GA  to  evolve,  we  selected  several  combinations  of  crossover 
probability  (P^)  and  mutation  probability  (P„),  and  moni¬ 
tored  the  growth  of  the  number  of  selected  features.  The 
initial  probability  of  feature  presence  was  fixed  at 
^init=0-002.  The  GA  was  allowed  to  evolve  with  two  differ¬ 
ent  a  values  of  the  penalty  term  in  the  fitness  function  of  Eq. 
(1).  We  observed  that  the  crossover  probability  P^  did  not 
have  a  major  effect  on  the  number  of  selected  features.  How¬ 
ever,  both  a  in  the  penalty  term  and  the  mutation  probability 
P„  affected  the  number  of  selected  features.  Figures  5  and  6 
plot  the  average  number  of  selected  features  over  ten  training 
sets  versus  the  generation  number  for  a=0  and  a-  l/200o", 
respectively.  The  average  number  of  selected  features  is  plot¬ 
ted  for  =0.001  and  P„  =0.003  in  each  figure.  The  cross¬ 
over  probability  is  kept  constant  at  P,=0.1.  The  test  A. 
value  obtained  up  to  a  given  generation  is  plotted  against  the 
generation  number  in  Figs.  7  and  8  for  the  same  conditions 
(ar=0  and  a=  1/2000),  respectively.  The  average  A.  value 
over  ten  test  sets  is  shown. 

It  is  observed  that  while  the  average  test  A.  value  does  not 
increase  after  the  25th  generation,  the  number  of  selected 
features  keeps  increasing  beyond  the  60th  generation  for  all 
combinations  of  GA  parameters  studied.  ^Since  the  main 
component  of  the  fitness  function  in  the  GA  is  the  A.  value 
rather  than  the  number  of  features,  more  features  may  be 
added  into  the  selected  feature  pool  as  long  as  the  area  under 
the  ROC  curve  does  not  deteriorate.  Comparing  Figs.  5  and 
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Fig.  5.  Evolution  of  the  number  of  selected  features  for  a-0. 


Fig.  6.  Evolution  of  the  number  of  selected  features  for  a=  1/2000. 


6,  it  can  be  observed  that  the  penalty  term  suppressed  the 
number  of  selected  features.  The  number  of  selected  features 
eventually  leveled  off  at  about  the  80th  generation  when  the 
penalty  term  was  nonzero  (Fig.  6). 

The  average  test  A.  values  at  the  end  of  100  generations 
were  0.89  for  the  combinations  studied  in  Fig.  8,  and  0.88  for 
the  combinations  studied  in  Fig.  7.  The  maximum  and  mini¬ 
mum  values  of  individual  test  scores  for  the  ten  partitions 
studied  were  0.92  and  0.86  for  Fig.  8,  and  0.92  and  0.85  for 
Fig.  7.  The  standard  deviation  of  the  individual  A,  values,  as 
determined  by  the  LABROCl  program,  varied  between  0.02 
and  0.04. 

Since  our  goal  is  to  select  a  small  number  of  features 
while  maintaining  a  high  classification  accuracy,  we  per¬ 
formed  subsequent  GA  experiments  with  a=  1/2000.  Due  to 
computation  time  constraints,  we  set  the  maximum  number 
of  generations  to  be  25  in  the  following  experiments. 
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Fig.  7.  Evolution  of  the  average  test  A.  for  a-0. 


Generation 


Fig.  8.  Evolution  of  the  average  test  A.  for  a— 1/2000. 

2.  Effect  of  initial  probability  of  feature  presence 

(Pinit) 

We  evaluated  the  effect  of  Pjnu  on  feature  selection  when 
the  crossover  probability  mutation  probability 

were  held  constant.  The  average  test  A,  values  for  Pc=0.9 
and  P^  =0.001  are  tabulated  in  Table  I.  It  is  observ’ed  that 
the  performance  of  the  GA  reaches  a  broad  maximum  when 
Pjj^j  is  in  the  range  of  0.0005  to  0.020,  i.e.,  when  the  average 
number  of  features  in  the  initial  chromosomes  is  approxi¬ 
mately  in  the  range  of  0.3  to  12.  When  Pj^u  is  out  of  this 
range,  the  average  test  A.  decreases  slightly. 

3.  Effect  of  probability  of  mutation  and  crossover 

The  effects  of  the  crossover  probability  P^  and  the  muta¬ 
tion  probability  Pm  on  the  classification  accuracy  are  sum¬ 
marized  in  Tables  II  and  III,  respectively.  In  Table  II,  the 
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Table  {.  The  effect  of  P  on  GA  performance  for  =0.001.  P  =0.9. 
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T.xBLi;  [II.  The  effect  of  on  GA  performance  for  ^,-=0.9 


Average  Test  Avg.  \um. 

of  features 


control  parameters  were  fixed  at  Pinj, =0.002,  and 
0.001,  while  in  Table  III,  they  were  fixed  at 
^mit=0.002.  and  P^=0.9.  For  fixed  values  of  P,„„  and  P 
the  average  test  A.  appears  to  increase  with  increasing  P^, 
while  the  number  of  selected  features  remains  relatively  con¬ 
stant.  On  the  other  hand,  for  fixed  values  of  and  P^ ,  the 
average  test  A.  increases  initially  with  increasing  ,  reach¬ 
ing  a  maximum  at  /’„=0.001,  and  then  decreases  slightly  as 
Pm  increases  beyond  0.003.  Although  the  variation  of  the 
classification  accuracy  with  respect  to  Pm  is  not  sisnificant, 
it  appears  that  a  reasonable  range  of  choice  for  Pm  is  such 
that  the  average  number  of  mutations  per  chromosome  per 
generation  is  less  than  1.5  (0.003  X  the  number  of  genes  per 
chromosome).  Within  the  range  studied,  the  number  of  se¬ 
lected  features  increases  with  increasing  ,  which  may  be 
the  reason  for  the  slight  deterioration  in  performance  for 
large  . 

4.  Comparison  with  LDA  ciassifier  and  random 
feature  seiection 

We  used  a  commercial  statistics  package,  SESS,"*®  for 
LDA  classification.  The  feature  selection  and  formulation  of 
the  discriminant  function  were  performed  on  each  of  the  ten 
training  sets,  and  the  discriminant  functions  were  tested  on 
the  corresponding  test  sets.  Using  minimization  of  Wilks’ 
lambda  as  the  feature  selection  criterion,  we  varied  the  two 
threshold  values  for  F  statistics  (F-to-enter  and  F-to- 
remove)  in  the  SPSS  package  so  that  the  average  test  A. 
value  over  the  ten  partitionings  was  maximized.  The  number 
of  selected  features  and  the  test  results  for  the  ten  partition¬ 
ings  are  tabulated  in  Table  IV.  We  chose  the  best  GA  clas¬ 
sification  results  (the  last  line  in  Table  II)  for  comparison 
with  those  of  the  LDA.  The  corresponding  test  A .  values  and 
the  number  of  selected  features  for  each  partitionina  of  the 
data  set  are  tabulated  in  Table  IV. 


T.ABLE  II.  The  effect  of  on  GA  performance  for  f’|„„=0.002.  P, ,,=0.001. 
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For  comparison  with  these  two  near-optimal  feature  selec¬ 
tion  methods,  we  performed  multiple  linear  regression  train¬ 
ing  and  testing  on  20  randomly  selected  features  out  of  the 
available  587  features.  The  test  A.  values  based  on  these  20 
randomly  selected  features  are  also  given  in  Table  IV. 

B.  Feature  seiection  for  BPN 

Since  training  a  BPN  is  considerabiy  slower  than  training 
a  linear  discriminant  classifier,  we  modified  our  training 
strategy  for  this  classifier.  The  basic  differences  between  the 
experiments  in  this  subsection  on  BPN  and  the  previous  sub¬ 
section  on  linear  discriminant  classifier  were:  (1)  In  order  to 
handle  a  smaller  feature  pool  with  BPN,  we  used  a  single 
distance  for  texture  features.  Based  on  our  previous  studv  of 
the  effects  of  pixel  distance  on  classification,'®  we  selected  a 
pixel  distance  of  d=20.  The  global  texture  features  com¬ 
puted  at  this  pixel  distance,  plus  the  morphological  features 
previously  described  in  Sec.  Ill  C,  constituted  the  feature 
pool  in  this  subsection.  Therefore,  there  were  a  total  of  41 
features  (26  texture  and  15  morphological)  for  the  feature 
selection  algorithms  to  choose  from.  (2)  In  order  not  to  re¬ 
peat  the  feature  selection  process  several  times  with  several 
different  training  sets,  the  entire  data  set  was  used  in  the 
feature  selection  step  of  the  classification  procedure.  After 
feature  selection  was  completed,  the  classifier  was  trained 
and  tested  with  50  different  partitionings  of  the  data  set  into 
training  and  test  groups.  As  in  the  case  of  linear  discriminant 
classifier,  the  number  of  mass  and  nonmass  ROIs  in  each 
training  set  was  126  and  378  (|  of  the  total),  respectively, 
while  the  number  of  mass  and  nonmass  ROIs  in  each  test  set 
was  42  and  126  (j  of  the  total),  respectively. 

The  parameters  of  the  BPN  and  the  GA  used  in  this  sub¬ 
section  were  as  follows.  The  BPN  had  a  variable  number  of 
input  nodes,  four  hidden  layer  nodes,  and  a  single  output 
node.  The  BPN  was  trained  for  400  iterations  for  each  chro¬ 
mosome  in  each  generation.  The  GA  was  allowed  to  evolve 
for  a  total  number  of  75  generations.  Results  of  the  previous 
subsections  suggest  that  there  is  a  wide  range  of  choice  for 
the  parameters  Fj„j,  and  .  It  appears  that  a  reasonable 
choice  for  Pjnj,  is  such  that  the  average  number  of  selected 
features  at  generation  0  is  in  the  range  of  0.3  to  12,  and  a 
reasonable  choice  for  P„  is  such  that  the  average  number  of 
mutations  per  chromosome  per  generation  is  less  than  1.5. 
For  this  reason,  these  parameters  of  the  GA  were  selected  as 
P„,=0.02,  and  P |ni,=0.02.  Since  a  large  probability  of  cross¬ 
over  seemed  to  result  in  the  selection  of  more  effective  fea- 
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Test  Az  Value 

Fig.  9.  The  distribution  of  the  test  /i.  values  for  the  linear  classifier  with 
stepwise  feature  selection.  BPN  classifier  with  stepwise  feature  selection, 
and  BPN  classifier  with  GA  feature  selection. 


tures  in  the  previous  subsections,  the  value  of  was  chosen 
as  0.9.  A  penalty  term  was  applied  to  the  fitness  function 
with  a=  1/2000. 

The  final  GA-selected  pool  of  variables  contained  16  fea¬ 
tures.  After  feature  selection  using  the  GA.  the  performance 
of  the  BPN  classifier  with  the  selected  features  was  tested 
with  50  training  and  test  groups  as  described  above.  The 
average  training  and  test  A.  values  over  50  partitionings 
were  0.92  and  0.90,  respectively. 

To  compare  our  GA-based  feature  selection  method  for  a 
BPN,  we  also  used  the  same  data  set  and  the  41  features 
described  above  with  stepwise  feature  selection.  The  entire 
data  set  was  used  for  feature  selection.  The  final  selected 
pool  of  variables  contained  19  features.  The  same  50  parti¬ 
tionings  used  for  the  GA  e.xperiments  were  used  to  train  and 
test  both  a  linear  discriminant  classifier  and  a  BPN  with  the 
stepwise-selected  features.  The  average  training  and  test  A. 
values  over  50  partitionings  were  0.92  and  0.89  with  the 
linear  classifier,  and  0.92  and  0.89  with  the  BPN  classifier. 
The  distribution  of  the  test  A.  values  for  the  linear  classifier, 
as  well  as  the  BPN  classifier  with  features  selected  using 
stepwise  and  the  GA-based  feature  selection  are  shown  in 
Fig.  9.  The  distribution  of  the  pairwise  difference  of  the  test 
A.  of  the  BPN  classifiers  with  stepwise  and  GA-based  fea¬ 
ture  selection  methods  is  shown  in  Fig.  10. 


V.  DISCUSSION 

Our  goal  in  this  paper  was  the  development  of  an  effec¬ 
tive  feature  selection  algorithm  given  a  large  number  of  fea¬ 
tures  extracted  from  an  image  data  set.  Table  IV  and  Figs.  9 
and  10  indicate  that  GA  feature  selection  might  be  a  viable 
alternative  to  stepwise  feature  selection. 

The  average  number  of  features  selected  by  stepwise  and 
GA-based  feature  selection  methods  for  a  linear  discriminant 
classifier  were  19.3  and  20.1,  respectively,  in  Table  IV.  In 
the  same  table,  we  compared  these  methods  to  random  fea¬ 
ture  selection  with  the  number  of  selected  features  equal  to 
20.  Both  methods  performed  better  than  random  feature  se¬ 
lection.  The  difference  between  the  average  A.  obtained  by 
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Fig.  10.  The  distribution  of  the  pairwise  difference  of  the  test  A.  values  of 
the  BPN  classifiers  with  GA  and  stepwise  feature  selection. 

GA-based  feature  selection  and  random  feature  selection  was 
more  than  two  times  the  standard  deviation  of  each  A-  dis¬ 
tribution. 

We  observed  that  each  time  the  GA  was  trained  w'ith  a 
different  training  set,  a  different  set  of  features  was  selected. 
This  was  also  true  for  stepwise  feature  selection.  The  basic 
reason  for  this  was  the  limited  size  of  the  data  set.  If  training 
sets  that  could  represent  the  entire  population  were  available, 
the  selected  set  of  features  could  be  expected  to  be  more 
consistent  among  different  training  sets.  With  the  limited 
data  set  used  in  this  study,  each  time  a  set  of  cases  was  left 
out  as  the  test  data,  the  statistical  characteristics  of  the  train¬ 
ing  feature  set  changed.  Furthermore,  many  of  the  features 
were  highly  correlated,  with  correlation  coefficients  close  to 
1  or  ~1.  Therefore,  these  correlated  features  could  be  inter¬ 
changed.  Only  ten  features  were  selected  three  or  more  times 
for  the  experiments  in  Table  IV.  Out  of  these  ten  features,  six 
were  texture  and  four  were  morphological  features.  This  in¬ 
dicates  that  morphological  and  texture  features  are  both  im¬ 
portant  for  the  classification  of  the  ROIs. 

The  high  correlation  between  the  features  in  the  feature 
space  used  in  this  study  is  probably  a  cause  of  the  surpris¬ 
ingly  good  classification  result  (A,  =  0.82)  obtained  with  the 
randomly  selected  features.  This  may  also  indicate  that  many 
of  the  features  in  the  feature  space  are  very  effective  for  this 


Table  IV.  Test  A.  values  of  a  linear  discriminant  classifier  using  stepwise 
LDA,  GA-based  feature  selection,  and  20  randomly  selected  features. 


Test  group 

Stepwise  LDA 

GA 

Random 

A, 

Num.  of  features 

A. 

Num.  of  features 

A, 

1 

0.87 

19 

0.90 

20 

0.80 

2 

0.91 

15 

0.89 

24 

0.86 

3 

0.92 

25 

0.93 

24 

0.86 

4 

0.88 

'>2 

0.88 

20 

0.81 

5 

0.86 

23 

0.84 

23 

0.78 

6 

0.92 

19 

0.93 

20 

0.83 

7 

0.92 

15 

0.91 

17 

0.87 

8 

0.84 

21 

0.88 

19 

0.75 

9 

0.86 

14 

0.88 

18 

0.77 

10 

0.88 

20 

0.92 

16 

0.82 

Average 

0.89 

19.3 

0.90 

20.1 

0.82 
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classification  task.  Therefore,  even  when  only  20  features  are 
randomly  drawn,  we  have  a  high  probability  of  drawing  ef¬ 
fective  features  and  obtaining  a  classification  result  that  is 
much  higher  than  that  would  be  obtained  by  chance. 

Our  results  indicate  that  the  classification  results  with 
GA-based  feature  selection  are  better  than  their  counterparts 
with  stepwise  feature  selection.  This  is  most  easily  seen  from 
Fig.  9,  which  compares  the  distribution  of  the  ,4  .  values  for  a 
BPN  classifier  with  GA-based  feature  selection  to  that  with 
stepwise  feature  selection.  It  can  be  observed  that  the  two 
distributions  are  shifted  with  respect  to  each  other,  with  the 
distribution  using  GA-based  feature  selection  e.xhibiting 
higher  A,  values.  However,  we  could  not  perform  a  paired 
f-test  to  evaluate  the  statistical  significance  of  the  differences 
for  the  results  listed  in  Table  IV  or  those  shown  in  Fig.  9. 
The  paired  r-test  requires  independence  among  the  samples 
whereas  our  test  (or  training)  sets  in  the  different  partition¬ 
ings  overlapped  with  each  other.  We  have  used  the  CLA- 
BROC  program'*'  to  test  the  statistical  significance  of  the 
difference  between  the  corresponding  pair  of  ROC  curves  for 
each  partitioning.  The  difference  did  not  achieve  statistical 
significance  for  the  individual  pairs  because  the  number  of 
cases  in  each  partitioned  data  set  is  small  and  thus  the  stan¬ 
dard  deviation  of  A.  is  large  (0.02  to  0.04).  However,  it 
should  be  noted  that  the  improvement  in  A.  with  GA-based 
feature  selection,  although  small,  is  consistently  observed 
over  the  different  partitionings  of  the  data  set.  over  both  the 
linear  discriminant  classifier  (Table  IV)  and  the  BPN  classi¬ 
fier  (Figs.  9  and  10),  as  well  as  over  different  data  sets.*'  The 
small  improvement  in  A.  may  be  attributed  to  two  causes: 
(1)  For  the  linear  discriminant  classifier,  the  stepwise  feature 
selection  procedure  is  already  near  optimal.  It  is  actually 
somewhat  unexpected  that  the  GA-based  feature  selection 
can  still  provide  an  observable  improvement  in  A. .  (2)  It  is 
well  known  that  BPN  performance  may  not  reach  the  global 
maximum  if  there  are  insufficient  training  samples.  For  the 
BPN  classifier  in  this  study,  the  number  of  weights  to  be 
trained  was  large  compared  with  the  number  of  input  training 
samples.  Therefore,  it  probably  did  not  reach  its  optimum 
when  it  was  used  in  a  GA  for  feature  selection.  Again,  a 
consistent  improvement  in  A,  demonstrates  that  the  GA  can 
select  more  effective  features  for  BPN  classifiers. 

The  main  advantage  of  GA-based  feature  selection  is  its 
flexibility.  GA-based  feature  selection  can  be  applied  to  any 
classifier  and  the  fitness  function  can  be  tailored  to  select 
features  with  specific  characteristics.  An  example  of  the 
former  application  is  to  select  features  for  a  nonlinear  clas¬ 
sifier  such  as  a  BPN  as  discussed  above.  An  example  of  the 
latter  application  is  to  select  features  for  development  of  a 
highly  sensitive  classifier*^  described  next. 

In  both  breast  cancer  detection  and  classification,  the  cost 
of  missing  a  malignant  lesion  is  very  high.  For  this  reason, 
an  important  measure  of  classification  accuracy  is  the  FP¥  at 
high  true-positive  classification.  Since  the  design  of  the  fit¬ 
ness  function  of  a  GA  is  very  flexible,  one  can  target  to 
maximize  the  partial  area  above  a  specified  TPF  in  order  to 
optimize  the  classifier  performance  in  this  region.  In  a  pre¬ 
liminary  study  with  our  data  set,'*^  we  designed  a  GA-based 


feature  selection  algorithm  in  which  the  fitness  of  a  chromo¬ 
some  was  defined  as  the  partial  area  above  a  TPF  of  0.95 
We  then  compared  the  FPF  at  TPFs  of  100%  and  96%  usin<^ 
GA-based  and  stepwise  feature  selection  for  a  linear  dis* 
criminant  classifier.  At  a  TFP  of  100%,  the  average  FPF  over 
the  ten  partitionings  used  in  this  study  were  0.44  for  GA- 
based  feature  selection,  and  0.68  for  stepwise  feature  selec¬ 
tion.  At  a  TFP  of  96%,  the  average  FPFs  were  0.33  for 
GA-based  feature  selection,  and  0.38  for  stepwise  feature 
selection.  These  encouraging  results  demonstrate  the  poten¬ 
tial  of  a  GA-based  approach  to  designing  classifiers  for  a 
wide  range  of  practical  problems,  which  cannot  be  achieved 
with  a  conventional  method  such  as  stepwise  discriminant 
analysis. 

Stepwise  feature  selection  is  computationally  faster  than 
GA-based  feature  selection.  For  example,  in  the  present 
study,  the  stepwise  feature  selection  required  64-s  CPU  time 
for  each  partition  (Table  IV)  on  a  90-MHz  Pentium-based 
personal  computer.  The  GA-based  feature  selection  required 
519-s  CPU  time  for  each  partition  (Table  IV)  on  a  133-MHz 
alpha-based  workstation,  when  the  evolution  involved  a  total 
of  250  chromosomes.  However,  a  GA  is  highly  paralleliz- 
able.  In  principle,  the  fitness  of  each  chromosome  can  be 
evaluated  on  a  different  processor  and  the  computation  time 
can  be  improved  up  to  a  factor  equal  to  the  number  of  chro¬ 
mosomes.  The  choice  between  GA-based  or  stepwise  feature 
selection  will  depend  on  the  application.  For  a  linear  dis¬ 
criminant  classifier,  the  stepwise  feature  selection  may  be 
near  optimal  so  that  the  advantage  of  using  a  GA  may  be 
small.  However,  for  other  classifiers,  a  GA  may  be  more 
effective  because  the  selected  feature  set  will  be  optimized  to 
the  specific  classifier  used. 

A  GA  was  previously  used  for  the  task  of  feature  selec- 
tion  in  a  classification  problem  with  30  features  and  150 
cases.  The  GA  fitness  criterion  in  this  application  was  de¬ 
signed  to  be  a  function  of  the  correct  classification  rate  with 
a  nearest-neighbor  classifier.  After  the  features  were  se¬ 
lected,  a  neural  network  was  employed  for  final  classifica¬ 
tion.  Our  approach  has  two  advantages  over  this  application. 
First,  we  used  a  more  sophisticated  classifier  in  the  fitness 
function  computation  stage,  hence  GA  training  is  more  effi¬ 
cient.  Second,  we  used  the  same  classifier  at  the  final  classi¬ 
fication  stage,  therefore  our  results  are  expected  to  be  more 
consistent.  Our  results  are  also  expected  to  be  less  biased 
since  we  divided  our  data  set  into  independent  training  and 
test  groups  for  GA  evaluation,  whereas  the  entire  data  set 
was  used  for  training  in  the  other  study.-* 

VI.  CONCLUSION 

We  investigated  the  use  of  a  GA  for  feature  selection,  and 
demonstrated  its  application  by  classifying  ROIs  on  mammo¬ 
grams  as  either  containing  mass  or  normal  tissue.  By  com¬ 
paring  stepwise  feature  selection  and  GA-based  feature  se¬ 
lection  for  two  different  classifiers  (the  linear  discriminant 
classifier  and  the  BPN),  and  by  examining  the  problem  of 
designing  classifiers  biased  to  have  high  sensitivity  perfor¬ 
mance,  we  have  demonstrated  the  versatility  offered  by  a  GA 
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in  the  design  of  classifiers  for  a  variety  of  classification  tasks 
without  a  trade-off  in  the  effectiveness  of  the  selected  fea¬ 
tures.  Future  work  in  this  area  includes  application  of  GA- 
based  feature  selection  to  different  classification  tasks  such 
as  differentiation  of  malignant  and  benign  tissue,  and  a  de¬ 
tailed  investigation  of  the  formulation  of  different  fitness 
measures,  such  as  the  partial  area  at  the  high-TPF  region  of 
the  ROC  curve,  for  the  design  of  classifiers  in  different  ap¬ 
plications. 
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This  paper  presents  segmentation  and  classification  results  of  an  automated  algorithm  for  the  de¬ 
tection  of  breast  masses  on  digitized  mammograms.  Potential  mass  regions  were  first  identified 
using  density-weighted  contrast  enhancement  (DWCE)  segmentation  applied  to  single-view  mam¬ 
mograms.  Once  the  potential  mass  regions  had  been  identified,  multiresolution  texture  features 
extracted  from  wavelet  coefficients  were  calculated,  and  linear  discriminant  analysis  (LDA)  was 
used  to  classify  the  regions  as  breast  masses  or  normal  tissue.  In  this  article  the  overall  detection 
results  for  two  independent  sets  of  84  mammograms  used  alternately  for  training  and  test  were 
evaluated  by  free-response  receiver  operating  characteristics  (FROC)  analysis.  The  test  results 
indicate  that  this  new  algorithm  produced  approximately  4.4  false  positive  per  image  at  a  true 
positive  detection  rate  of  90%  and  2.3  false  positives  per  image  at  a  true  positive  rate  of  80%. 
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1.  INTRODUCTION 

Breast  cancer  is  the  most  common  malignancy  affecting 
women  and  is  second  only  to  lung  cancer  in  tumor  related 
deaths  in  females.  It  was  estimated  that  182  000  new  cases  of 
breast  cancer  would  occur  in  American  women  and  42  000 
women  would  die  from  the  disease  in  1994.^  This  comprises 
32%  of  all  new  cases  of  cancer  and  18%  of  cancer  deaths  in 
women.  ^  Efforts  to  decrease  the  mortality  are  currently 
aimed  at  early  diagnosis  and  complete  removal  of  small  non¬ 
metastatic  lesions.^  In  an  attempt  to  reduce  cost  and  increase 
effectiveness,  investigators  are  developing  new  techniques  to 
improve  detection  of  early  breast  cancers.^  Computer-aided 
diagnosis  (CAD)  is  one  technique  that  may  achieve  both 
goals  of  lowering  cost  and  increasing  effectiveness.*^  CAD  is 
especially  well  suited  for  the  digital  imaging  technology 
which  is  being  developed  to  produce  digital  images  in  full 
view  mammography. 

Several  research  groups  have  developed  computer  algo¬ 
rithms  for  automated  detection  of  mammographic  masses. 
Kegelmeyer  has  reported  promising  results  for  detecting 
spiculated  lesions  based  on  local  edge  characteristics  and 
Laws  texture  features,^’^  Both  Lai  et  ai'^  and  Qian  et  al^ 
proposed  different  variations  of  median  filtering  to  enhance 
the  digitized  image  prior  to  object  identification.  A  thresh¬ 
olding  method  for  mass  localization  and  a  mass  classification 
algorithm  using  fuzzy  pyramid  linking  have  been  developed 
by  Brzakovic  et  al,"^  Other  investigators  have  proposed  using 
the  asymmetry  between  the  right  and  left  breast  images  to 
determine  possible  mass  locations.  Yin  et  ai  uses  both  linear 
and  nonlinear  bilateral  subtraction^^  while  the  method  by 
Lau  et  ai  relies  on  “structural  asymmetry”  between  the  two 
breast  images.  The  above  methods  produced  between  one 


and  five  false  detections  for  a  true  positive  detection  rate  of 
approximately  90%.  However,  it  is  difficult  to  compare  the 
effectiveness  of  these  methods  because  each  used  a  unique 
set  of  digitized  mammograms,  and  the  results  varied  between 
training  and  test.  A  general  comparison  between  algorithms 
is  further  complicated  by  the  fact  that  most  of  these  studies 
were  conducted  using  small  data  sets.  While  initial  results 
from  the  first  large  scale  preclinical  study  have  been 
encouraging,^"  the  performance  of  detection  programs  with 
clinical  samples  may  not  match  their  performance  in  labora¬ 
tory  tests. 

Our  preliminary  study  introduced  the  density-weighted 
contrast  enhancement  (DWCE)  segmentation  method  and 
found  that  it  was  capable  of  detecting  breast  masses  on  25 
digitized  mammograms.  In  this  article,  a  set  of  168  digi¬ 
tized  mammograms  is  used  to  evaluate  a  modified  version  of 
the  original  DWCE  segmentation  method  in  combination 
with  a  texture  classification  scheme.  The  following  proce¬ 
dure  was  used  to  evaluate  this  new  detection  scheme.  The  set 
of  digitized  mammograms  was  first  segmented  into  potential 
breast  masses  using  the  DWCE  segmentation.^^  This  method 
employed  an  adaptive  filter  to  enhance  structures  within  the 
breast  region  of  a  mammogram  and  then  identified  the  struc¬ 
tures  using  a  simple  edge  detection  algorithm.  Once  the  digi¬ 
tized  images  were  segmented  using  the  DWCE,  regions  of 
interest  (ROIs)  based  on  the  detected  breast  structures  were 
extracted  from  each  mammogram,  and  a  set  of  multiresolu¬ 
tion  texture  features  were  calculated  for  each  extracted  ROI. 
The  feature  set  was  then  used  by  a  linear  discriminant  analy¬ 
sis  (LDA)  algorithm  to  reduce  the  number  of  false  detec¬ 
tions.  Finally,  the  performance  of  the  DWCE  segmentation 
and  ROI  texture  classification  scheme  was  evaluated  using 
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^  tree-response  receiver  operating  characteristics  (FROC) 
analysis. 

II.  MATERIALS  AND  METHODS 
A.  Database 

The  clinical  mammograms  used  in  this  study  were  ac¬ 
quired  with  American  College  of  Radiology  accredited  mam¬ 
mography  systems.  Kodak  MinR/MRE  s'creen/tilm  systems 
with  extended  cycle  processing  were  used  as  the  image  re¬ 
corder.  The  mammography  systems  have  a  0.3-mm'^focal 
spot,  a  molybdenum  anode,  0.03-mm-thick  molybdenum  fil¬ 
ter,  and  a  5.1  reciprocating  grid.  The  mammograms  were 
selected  from  the  files  of  patients  who  had  undergone  biopsy 
at  the  University  of  Michigan  in  the  last  five^years.  The 
selection  criterion  used  by  the  radiologists  was  simply  that  a 
biopsy-proven  mass  existed  on  the  mammogram.  This  set 
excluded  lesions  visible  only  by  architectural  distortions 
•i.e.,  no  defined  mass)  but  included  masses  accompanied  bv 
..alcifications.  No  attempt  was  made  to  match  the  number  of 
malignant  and  benign  mass  cases,  but  we  did  try  to  include  a 
cross  section  of  malignant  masses.  This  led  to  a  much  larger 
proportion  of  malignant  lesions  than  that  in  the  general 
screening  population.  To  avoid  the  effect  of  the  repetitive 
grid  pattern  on  the  texture  feature  calculations,  all  mammo¬ 
grams  with  visible  grid  lines  were  e.xcluded  for  the  original 
set.  Our  final  data  set  for  this  preliminary  study  was  com¬ 
posed  of  168  single-view  mammograms.  It  included  85  ma¬ 
lignant  and  83  benign  masses.  The  size  of  the  masses  ranged 
from  5  mm  to  26  mm  with  a  mean  size  of  12.2  mm,  and  their 
visibility  ranged  from  1  (obvious)  to  10  (subtle)  with  a  mean 
visibility  of  4.51.  A  more  complete  discussion  of  the  images 
selected  for  this  study  can  be  found  in  Wei  et  al.'^ 

The  mammograms  were  digitized  with  a  LUMISYS  DIS- 
1000  laser  film  scanner  with  a  pixel  size  of  100  /zm  and  4096 
gray  levels.  The  digitizer  logarithmically  amplifies  the  light 
transmitted  through  the  mammographic  film  before  analog- 
to-digital  conversion  so  that  the  gray  levels  are  linearly  pro¬ 
portional  to  optical  densities  in  the  range  of  0.1  to  2.8  optical 
density  units  (O.D.).  The  O.D.  range  of  the  scanner  is  0-3.5 
'^ith  large  pixel  values  in  the  digitized  mammograms  corre¬ 
sponding  to  low  O.D.  The  digitized  images  used  in  this  study 
were  approximately  2000X2000  pixelsln  size.  To’ conserve 
processing  time  and  reduce  noise  in  the  initial  DWCE  seg¬ 
mentation  stages,  the  full  resolution  mammograms  were  first 
smoothed  with  an  8X8  box  filter  and  subsampled  by  a  factor 
of  8.  resulting  in  800-jum  images  of  approximately  256X256 
pi.xels  in  size.  However,  the  texture  features  used  in  the  final 
LDA  classification  were  calculated  from  the  original  images 
with  a  100-yu.m  pi.xel  size. 

The  location  and  extent  of  all  the  biopsy-proven  masses 
were  marked  on  the  original  films  by  a  radiologist.  They 
were  then  localized  on  the  digitized  images  and  stored  in  a 
"tmth”  file  on  the  computer  by  defining  both  the  centroid 
(approximate  center)  of  the  lesion  and  the  smallest  bounding 
box  (rectangle)  containing  the  entire  lesion.  Both  of  these 
procedures  were  performed  by  hand  using  the  original 
marked  film  as  a  guide.  The  centroid  “truth”  was  used  to 


analyze  the  initial  DWCE  segmentation.  If  an  object  seg¬ 
mented  by  the  DWCE  contained  the  centroid  of  the  mass 
within  the  object  region,  it  was  considered  a  true  positive 
(TP);  otherwise,  it  was  considered  a  false  positive  (FP).  The 
centroid  provided  a  fast  method  for  evaluating  the  DWCE 
segmentation  in  its  global  and  local  stages.  However,  the 
final  texture  classification  results  are  based  on  the  more  pre¬ 
cise  bounding  box  "truth.”  A  region  was  considered  a  TP 
detection  when  at  least  50%  of  the  “truth”  bounding  box 
was  detected.  The  centroid  and  bounding  box  definitions  for 
the  mass  provided  both  an  efficient  mechanism  for  develop¬ 
ment  of  the  DWCE  and  an  accurate  final  analysis  for  the 
overall  detection  scheme. 

For  evaluation  of  the  DWCE  segmentation  and  subse¬ 
quent  texture  classification,  the  168  single-view  mammo¬ 
grams  were  randomly  divided  into  two  groups  of  84  images, 
groups  G 1  and  G2.  with  the  constraint  that  all  images  from  a 
single  patient  were  kept  in  the  same  group.  A  sirigle  set  of 
DWCE  segmentation  parameters  was  applied  to  all  images 
(Gl  and  G2)  to  extract  potential  mass  regions.  The  regions 
extracted  from  the  Gl  and  G2  images  were  then  alternately 
used  as  training  and  test  sets  in  the  texture  classification  as 
described  below. 

B.  Density-weighted  contrast  enhancement 
segmentation 

Edge  detection  applied  to  an  unenhanced  image  was  not 
effective  in  detecting  breast  masses  because  of  the  low 
signal-to-noise  ratio  of  the  edges  and  the  presence  of  com¬ 
plicated  stmctured  background.  To  overcome  these  prob¬ 
lems,  we  have  developed  a  new  algorithm  using  DWCE  fil¬ 
tering  along  with  Laplacian-Gaussian  (LG)  edge  detection 
for  automatic  segmentation  of  low  contrast  structures  in  digi¬ 
tal  mammograms.'^  The  DWCE  segmentation  method  em¬ 
ployed  adaptive  filtering,  edge  detection,  and  morphological 
fT*  reduction  to  detect  potential  breast  masses  in  a  two-stage 
approach.  In  the  first  stage,  DWCE  segmentation  was  ap¬ 
plied  globally  to  the  entire  breast  region  of  the  mammogram 
to  identify  ROIs.  In  the  second  stage,  the  segmentation  was 
applied  locally  to  the  ROIs  identified  in  the  global  stage. 
Figures  1(a)  and  1(b)  depict  the  block  diagrams  for  the  glo¬ 
bal  and  local  stages  of  this  algorithm.  The  DWCE  segmen¬ 
tation  was  originally  introduced  by  Petrick  et  alP  but  has 
been  slightly  modified  in  this  study  to  improve  its  overall 
performance.  In  the  following  subsections  we  will  summa- 
nze  the  main  components  of  both  the  global  and  local  stages, 
and  highlight  the  differences  between  the  original  and  cur¬ 
rent  implementations  of  the  DWCE  technique? 

1.  Global  stage:  Density-weighted  contrast 
enhancement  filtering 

The  DWCE  filter  was  developed  to  accentuate  mammo¬ 
graphic  structures  before  edge  detection  by  adaptively  en¬ 
hancing  local  contrast  and  is  an  extension  of  the  local  con¬ 
trast  and  mean  adaptive  filter  proposed  by  Peli  and  Lim.'^ 
The  block  diagram  of  the  filter  is  shown  in  Fig.  2,  while  Fia. 

3  contains  examples  of  the  images  produced  by  each  filter 
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F'G.  1.  The  block  diagram  of  [he  two-stage  DWCE  segmentation  method 
used  for  the  initial  breast  mass  detection.  The  block  diagram  for  the  global 
stage  is  depicted  in  (a)  while  the  local  stage  is  shown  in  (bi.  Note,  the 
outputs  from  the  global  stage.  are  individually  processed  in  the 

local  stage. 


block  for  a  typical  mammogram  from  our  image  set.  All  the 
DWCE  functions  introduced  in  the  following  discussion  cor¬ 
respond  to  the  steps  illustrated  in  Fig.  2. 

DWCE  filtering  was  applied  to  the  breast  region  [i.e.,  the 
breast  map  F;^j3p(,v.y)]  of  each  mammogram  which  had  been 
identified  using  thresholding  and  edge  detection. Figure 
3(a)  shows  a  typical  mammogram,  F(.r.y),  at  800-/zm  reso¬ 
lution  while  3(b)  shows  its  breast  map.  The  pixel  intensities 
from  Fix.y)  within  the  breast  map  were  next  rescaled  to  be 
between  0.0  and  i.O  producing  a  normalized  breast  image, 
Fy(x.y  ).  This  normalization  reduced  the  gray-level  variation 
due  to  breast  tissue  composition  and  the  imaging  technique 
so  that  a  single  set  of  filter  parameters  could  be  applied  uni¬ 
formly  to  all  digitized  mammograms. 

The  normalized  image  was  next  split  into  a  density  and  a 
contrast  image,  F^ix^y)  and  Fc(x.y).  respectively.  F^(x,y) 
was  produced  by  low-pass  filtering  the  normalized  input  im¬ 
age  using  G{0,cr£)},  a  Gaussian  filter  with  zero  mean  and 
standard  deviation  cr^  =  S.O.  Likewise,  Fei^x.y)  was  pro¬ 
duced  by  bandpass  or  high-pass  filtering  the  normalized  im¬ 
age.  In  the  current  DWCE  implementation,  FeC-r.y)  is  cre¬ 
ated  by  subtracting  the  density  image  from  the  normalized 
input. 

F^i  .r ,  y )  =  F,v(.r  ,y )  -  F^  (x  ,y ) ,  ( 1 ) 

or 

Fc(A-.y)  =  F^v('’^^>’)  ^  O{0,aQ}^‘^Fx{x.y).  (2) 
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Fig.  3.  (at  A  typical  mammogram  from  our  image  database;  (b)  the  corre¬ 
sponding  breast  map  used  in  the  DWCE  segmentation;  (c)  the  density  image 
(Fo(.r,y)K  (d)  the  contrast  image  weighted-contrast  im¬ 

age  (FKc^.r.y));  (f)  the  rescaled  weighted-contrast  image  (F£(.r,y));  (g)  the 
detected  structures  remaining  after  the  global  FP  reduction  step;  (h)  the 
detected  structures  remaining  after  the  final  splitting  FP  reduction  step. 


where  represents  two-dimensional  convolution.  Figures 
3(c)  and  3(d)  show’  the  density  and  contrast  images  obtained 
using  this  procedure. 

The  local  density  value,  Fp{x,y),  was  then  used  to  deter¬ 
mine  a  multiplication  factor,  K^fiF^ix^y)),  for  each  pixel 
(A,y)  in  the  image.  The  multiplication  factor  was  used  to 
either  enhance  or  suppress  the  local  contrast  and  thereby  pro¬ 
duced  a  new  weighted  contrast  image: 

FKc(-^^y)  =  /^^v/(^o(-^^y))XFc(A,y).  (3) 

This  process  allowed  the  DWCE  filter  to  adapt  to  local  back¬ 
ground  characteristics  within  the  image  and  was  the  principle 
component  for  our  adaptive  signal-to-noise  ratio  (SNR)  en¬ 
hancement.  In  this  case,  the  signal  refers  to  breast  masses  or 
other  predominant  structures  within  the  breast.  The  output  of 
the  DWCE  filter  was  given  as 

^£('''^•2 )  ~  ^nl(^  kcL^^v  ))  X  F{<^c{  A'.y ),  (4) 

w’here  each  pixel,  (A,y),  in  the  weighted  contrast  image  was 
used  to  define  a  second  multiplication  factor, 
that  nonlinearly  scaled  the  weighted  contrast 
image.  This  nonlinear  scaling  was  used  to  further  suppress 
the  background  and  to  separate  merged  structures  in  the 
DWCE  enhanced  image.  Figures  3(e)  and  3(f)  show  the 
weighted  contrast  and  scaled  weighted  contrast  images,  re¬ 
spectively.  obtained  with  the  DWCE  technique. 

It  can  be  seen  that  the  two  muitipHcacion  functions,  K^f 
and  ATxl.  define  the  enhancement  properties  of  the  filter. 
These  functions  can  be  tailored  to  suit  a  specific  task.  Figures 
4(a)  and  4(b)  show  the  curves  selected  for  and 
respectively,  in  the  current  filter.  The  shape  of  the  density- 
weighted  contrast  function,  F  v/ ,  was  selected  to  accentuate 
-  =  F^(A,y)),  the  contrast  at  pixels  in  the  den- 
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Fig.  4.  Plots  of  (a)  the  weighted-contrast  multiplication  function  (K^tiz)) 
and  (b)  the  nonlinear  rescaling  function  used  in  the  DWCE  filter¬ 

ing. 


sity  image  with  medium  to  high  intensity,  while  deemphasiz- 
the  contrast  at  pixels  with  low  intensity. 
Thus,  this  function  suppressed  small  structures  mainly  sur¬ 
rounded  by  background  tissue  and  enhanced  larger  structures 
which  are  more  likely  to  be  masses.  The  exact  shape  of  the 
multiplication  function  was  determined  experimentally  by 
observing  how  detection  was  affected  by  variations  in  • 
We  chose  {AT  v/(z)^  1.0:0. 1.0}  in  the  current 
weighted  contrast  function  so  that  75%  of  the  intensity  range 
was  enhanced.  K^f  was  found  to  be  effective  in  reducing  the 
background  and  enhancing  breast  structures,  but  it  did  not 
provide  adequate  separation  between  the  structures.  The 
shape  of  the  nonlinear  scaling  function 

selected  to  provide  additional  separation 
between  objects.  Very  low  contrast  regions  were  strongly 
deemphasized,  thus  eliminating  many  low-intensity  bridges 
between  individual  structures.  It  was  also  found  that  a  slight 
suppression  of  the  highest  contrast  intensities  provided  a 
more  uniform  intensity  distribution  across  detected  breast 
structures.  Again,  the  specific  shape  of  the  nonlinear  contrast 
scaling  was  determined  experimentally  by  observing  the  ef¬ 
fect  of  different  functional  forms  on  the  detection  and  object 
separation.  A  complete  discussion  of  the  DWCE  multiplica¬ 
tion  functions  used  in  this  study  can  be  found  in  the 
literature. 
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2,  Global  stage:  Object  edge  detection 

The  DWCE  filtering  was  applied  to  the  original  mammo¬ 
gram  to  facilitate  the  detection  of  structures  within  the  imase 
and  thus  provided  an  estimate  of  their  physical  extent. The 
DWCE  implementation  provided  significant  background  re¬ 
duction,  as  shown  in  Fig,  3(f),  allowing  for  the  use  of  a  less 
complex  edge  detector.  In  this  study,  object  edges  were  iden¬ 
tified  from  the  DWCE  filtered  mammogram  using  a 
Laplacian-Gaussian  (LG)  edge  detector  [Block  2  in  Fig. 
1(a)].  Edges  in  the  enhanced  image,  F£(.r,y),  were  defined 
as  the  zero  crossing  locations  of 

V'G{0,a'£}^F£(.r,y),  (5) 

where  C{0,o*£}  was  a  zero  mean  Gaussian  smoothing  func¬ 
tion  with  standard  deviation  cr£= 2.0.^^  The  advantases  of 
this  edge  detector  are  that  its  performance  is  independent  of 
edge  direction  and  that  it  tends  to  produce  closed  regions. 

After  the  edge  detection,  all  enclosed  structures  were 
filled  to  eliminate  any  holes  that  may  have  formed  inside 
individual  objects.  The  edges  from  each  of  the  filled  objects 
were  tracked  and  identified.  This  edge  detection  is  identical 
to  the  original  DWCE  implementation  described  in  the 
literature. 


3.  Global  stage:  False  positive  reduction 

The  DWCE  filtering  and  subsequent  edge  detection  do  not 
differentiate  between  mass  and  normal  tissues,  therefore,  a 
large  number  of  potential  regions  were  usually  found.  Since 
the  shape  of  breast  masses  in  general  are  different  from  those 
of  normal  tissue,  we  extracted  morphological  features  and 
used  a  classification  algorithm  to  identify  some  of  these  dif¬ 
ferences  [Block  3  in  Fig.  1(a)].  The  goal  here  w-as  to  reduce 
the  number  of  FP  regions  without  losing  a  significant  number 
of  true  masses,  thus  allowing  the  maximum  number  of  TP 
regions  to  be  passed  on  to  the  local  processing  stage.  In  this 
study,  six  additional  morphological  features  were  combined 
with  the  original  set  of  five  features  used  in  the  previous 
study^^  to  improve  the  differentiation  between  mass  and  nor¬ 
mal  tissue  objects.  The  original  features  were  the  number  of 
edge  pixels  (P),  the  total  object  area  (A=area(Fobj)),  the  ob¬ 
ject’s  contrast,  circularity,  and  rectangularity.  The  new  fea¬ 
tures  added  in  this  implementation  were  the  perimeter-to- 
area  ratio  (PAR)  and  a  set  of  five  normalized  radial  length 
(NRL)  features.  To  define  circularity  and  rectangularity,  the 
minimum  sized  bounding  box  completely  containing  the  ob¬ 
ject,  and  a  circle  with  an  area  equivalent  to  the 

object  area,  F^q(-r,y),  were  calculated.  F^^(x,y)  was  cen¬ 
tered  at  the  object's  centroid  location  and  had  radius 
given  by 


Circularity  and  rectangularity  were  then  defined  as 


Circularity^ 


area(F^bjnF,q) 

area(Fobj) 


(7) 
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area(Fobj) 

Rectanaularity^ - 7Tr-r.  (8) 

area(Fbb) 

The  five  NRL  features  were  a  subset  of  the  features  defined 
by  Kilday  et  A  radial  length  function  was  defined  as  the 
Euclidean  distance  from  an  object's  centroid  to  each  of  its 
edge  pixels  and  normalized  relative  to  the  maximum  radial 
length  for  the  object.  This  created  an  NRL  vector  given  as 


The  maximum  and  minimum  feature  limits,  and  , 
respectively,  were  identical  for  both  the  G1  and  G2  image 
groups  and  were  selected  as  a  multiple  of  the  individual  mass 
object  bounds: 

fjhr={k  ma,x(/;.y): 

} 

;E[index  of  all  detected  mass  objects]},  (16) 


(9) 

where  N ^  was  the  number  of  edge  pixels  in  the  object.  The 
histogram  of  the  radial  length  was  also  calculated  and  cre¬ 
ated  the  probability  vector 

(10) 

where  N ^  was  the  number  of  bins  used  in  the  histogram.  The 
NRL  features  selected  in  this  study  were  the  NRL  mean 
value,  standard  deviation,  entropy,  area  ratio,  and  zero  cross¬ 
ing  count.  They  are  defined  as 

f^^KL—TT  2  ''it’  (11) 

k  =  0 

^NRL^  2  ('■it‘~MNRL)“’  (1“) 

V  /V£  jt  =  o 

£'nrl=-  S  Pj^ogipj).  (13) 


(14) 

ZCCmrl^  number  of  zero  crossings  of 

(15) 

A  complete  description  of  all  the  NRL  features  used  in  this 
study  can  be  found  in  the  literature. 

The  extracted  morphological  features  were  used  in  a  se¬ 
quential  classification  scheme;  a  simple  threshold  classifier, 
followed  by  an  LDA  classifier,  and  finally  followed  by  a 
backpropagation  neural  network  (BPN).  The  purpose  of  each 
classifier  was  to  reduce  the  number  of  FP  regions  with  a 
minimum  number  of  TP  losses.  This  improved  reduction 
scheme  was  selected  because  it  has  been  found  that  sequen¬ 
tial  or  parallel  combinations  of  the  different  classifiers  often 
increased  the  classification  accuracy  over  the  individual 
classifiers.*^'"^  This  is  probably  because  they  extract  different 
information  from  the  feature  space.  The  threshold  classifier 
simply  set  a  maximum  and  a  minimum  value  for  each  mor¬ 
phological  feature.  This  provided  some  initial  reduction  and 
prevented  the  LDA  and  BPN  classifiers  from  training  with 
nonrepresentative  object  features.  If  all  the  morphological 
features  from  a  detected  object  fell  within  the  bounds,  it  was 
kept  as  a  potential  mass;  otherwise,  it  was  considered  to  be 
normal  tissue  and  discarded.  All  DWCE  detected  objects 
with  features  values  within  the  defined  limits  were  saved  as 
potential  mass  objects  and  passed  on  to  the  LDA  classifier. 


i^k  MnRl)  ■  ’ 


_ ^ _  X 

/^mL  k  =  i 


fThi={k  minifi  j): 

J 

ye  [index  of  all  detected  mass  objects]},  (17) 

where /,•  y  is  the  value  of  the  iih  feature  (i  e[l,l  1])  for  the yth 
detected  object.  For  this  study,  the  multiplication  factor  (k) 
was  selected  to  be  1.0.  The  second  classifier,  LDA,  formed  a 
linear  combination  of  the  morphological  features  and  pro¬ 
duced  a  single  discriminant  score  for  all  remaining  potential 
mass  object.  This  classification  scheme  will  be  described  in 
more  detail  in  Sec.  II  C.  The  LDA  classifier  applied  to  the 
Gl  objects  was  trained  with  the  G2  object  features  and  vice 
versa.  This  provided  independent  LDA  training  for  each  of 
the  image  sets.  In  order  to  minimize  the  probability  of  losing 
true  masses,  a  lax  discriminant  threshold  was  chosen  to  re¬ 
tain  most  of  the  masses  while  achieving  moderate  FP  reduc¬ 
tion.  The  reduced  sets  of  Gl  and  G2  objects  with  their  mor¬ 
phological  features  were  then  passed  on  to  a  final  BPN 
classification  step.  BPN  formed  a  nonlinear  combination  of 
the  morphological  features  into  a  single  discriminant  score. 
A  complete  description  of  the  BPN  morphological  classifi¬ 
cation  can  be  found  in  the  literature.*^’"*'""  In  this  step,  a 
three  input  node,  four  hidden  node,  single  output  BPN  archi¬ 
tecture  was  utilized.  The  BPN  classifier  was  trained  in  a 
similar  fashion  as  the  LDA  but  only  the  three  most  uncorre¬ 
lated  features  (area,  perimeter-to-area  ratio,  and  contrast) 
were  used  as  the  input  features.  The  individual  Gl  and  G2 
image  sets  were  again  used  to  train  a  pair  of  BPN  classifiers, 
and  the  discriminant  thresholds  were  chosen  to  maximize  FP 
reduction  while  minimizing  the  loss  of  masses.  All  remain¬ 
ing  DWCE  detected  objects  after  the  application  of  the  three 
classifiers  were  considered  as  potential  mass  objects  and 
passed  on  to  the  ROI  segmentation  and  subsequent  local 
stage  of  the  DWCE  segmentation.  Figure  3(g)  show's  the 
final  reduced  set  of  objects  detected  by  the  global  stage  for 
the  original  mammogram  of  Fig.  3(a). 


4.  Global  stage:  ROI  segmentation 

The  final  step  in  the  global  stage  was  the  segmentation  of 
the  detected  local  regions  [Blocks  in  Fig.  1(a)].  For 

each  remaining  potential  mass  object,  a  ROI  corresponding 
to  the  object's  bounding  box  was  defined  on  the  subsampled 
mammogram.  The  minimum  size  for  these  ROIs  was  chosen 
to  be  32x32  pixels.  A  bounding  box  of  an  object  smaller 
than  this  size  was  uniformly  expanded  in  each  direction 
(horizontal  and  vertical)  until  it  reached  32X32  pixels.  These 
defined  object  regions  were  then  used  as  input  ROIs  to  the 
local  DWCE  stage. 
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5.  Local  stage:  DWCE  filtering,  edge  detection,  and 
local  false  positive  reduction 

The  local  stage  of  the  DWCE  segmentation  was  very 
similar  to  the  global  stage  and  is  again  depicted  in  Fig.  Kb). 
The  main  difference  was  that  the  processing  was  performed 
in  local  regions  within  the  image.  This  local  processing  al¬ 
lowed  the  DWCE  filter  to  adapt  to  the  intensity  distribution 
within  each  ROI  and  thus  refined  the  borders  of  the  detected 
objects.  The  input  images  to  this  stage.  Fi^  (x,y),  were  de¬ 
fined  from  the  detected  objects  in  the  global  stage.  This  local 
stage  had  five  main  components.  Three  of  the  components 
had  corresponding  global  stage  counterparts,  and  they  in¬ 
cluded  a  second  DWCE  filter.  LG  edge  detector,  and  local 
FP  reduction  step.  The  local  DWCE  filter  and  LG  edge  de¬ 
tector  used  identical  parameters  as  their  first  stage  counter¬ 
parts.  while  the  FP  reduction  step  again  used  the  1 1  morpho¬ 
logical  features  and  the  sequential  thresholding,  LDA,  and 
BPN  classification  discussed  previously.  The  only  difference 
in  the  local  FP  reduction  was  that  the  feature  and  discrimi¬ 
nant  thresholds  were  adjusted  to  reflect  the  morphological 
properties  of  the  locally  extracted  structures.  Again,  the  goal 
of  this  FP  reduction  step  was  to  reduce  the  number  of  poten¬ 
tial  mass  regions  before  the  regions  were  processed  by  a  final 
texture  classification  stage.  Therefore,  lax  decision  thresh¬ 
olds  were  chosen  to  minimize  additional  losses  of  true  mass 
objects. 


6.  Local  stage:  Object  splitting  and  splitting  FP 
reduction 


The  local  processing  of  the  mammograms  lead  to  larger 
objects  because  of  the  improved  estimate  of  the  local  back¬ 
ground.  However,  the  larger  objects  often  resulted  in  region 
merging,  (i.e.,  different  structures  within  the  breast  merged 
into  a  single  detected  region).  An  object  splitting  step  was 
therefore  added  to  the  local  stage  [Block  4  in  Fig.  1(b)].  This 
splitting  step  enabled  the  use  of  fixed  sized  ROIs  in  the  final 
texture  classification.  The  splitting  algorithm  searched  for 
narrowings  in  the  cross  section  of  an  object.  The  algorithm 
initially  found  the  cross-section  width  for  each  column  in  the 
object  [F’x(-^)  length  n].  Using  FxM,  three  parameters 
were  calculated  for  each  x.  They  were  the  area  ratio  of  the 
two  created  objects  along  with  the  global  and  local  cross- 
section  width  ratios.  These  ratios  were  defined  as 


Area’ 


(.V)  = 


/^CbiU)  =  |  1-0- 
1.0- 


min(A^(.r),A^(.Y)) 

max(A^(.t),A^(.r)) 


max(Fx{z)) ' 
Fx(-x) 

ma\(Fx{z)} ' 


Cg[0,  n  -  l]|, 
ze[x-2,x^2] 


(18) 

(19) 

(20) 


where  Ai^{x)  and  A^(.r)  were  the  area  of  the  right  and  left 
objects  produced  by  splitting  at  location  .r.  At  each  potential 
neck  location,  x,  a  cut  value  Fcui(x)  was  defined  as  a  linear 
combination  of  the  cross-section  ratios  and  the  area  ratio 


Fc,, (x)=  L5FGb,(.r)  +  2.0FL,i(^r)^  l.0F,,ea(-O.  (21) 


After  similar  cut  functions  were  computed  for  each  row  and 
for  the  45''  and  135°  directions,  a  maximum  cut  value  was 
found  for  the  object  and  compared  to  a  cut  threshold.  If  this 
maximum  cut  value  exceeded  this  threshold,  the  object  was 
split  at  that  point;  otherwise,  it  was  left  unchanged.  If  the 
object  was  split,  the  same  algorithm  was  applied  to  the  newly 
formed  objects  until  no  further  splitting  occurred.  The  split¬ 
ting  algorithm  incorporated  area  information  into  the  split¬ 
ting  process,  thereby  giving  preference  to  narrowings  closer 
to  the  center  of  the  object  and  minimizing  the  number  of 
times  an  object  was  split.  For  a  complete  description  of  this 
splitting  algorithm  refer  to  Petrick  et 

The  final  FP  reduction  [Block  5  in  Fig.  1(b)]  again  em¬ 
ployed  the  1 1  morphological  features  and  the  sequential  clas¬ 
sification  scheme  described  in  Sec.  II  B  3,  The  feature  and 
discriminant  thresholds  were  adjusted  to  reflect  the  morpho¬ 
logical  properties  of  the  split  objects.  Figure  3(h)  shows  the 
set  of  detected  objects  after  the  complete  two-stage  DWCE 
segmentation  for  the  original  mammogram  of  Fig.  3(a). 

C.  Texture  classification 

After  the  DWCE  segmentation  identified  a  set  of  potential 
mass  objects  in  the  mammograms,  ROIs  corresponding  to 
the  detected  object  locations  were  extracted  from  the  original 
100-^m  images  and  used  as  input  to  a  texture  classifier.  The 
extracted  ROIs  had  a  fixed  size  of  256X256  pixels  and  the 
center  of  each  ROI  corresponded  to  the  centroid  location  of  a 
detected  object.  When  the  object  was  located  close  to  the 
border  of  the  manrunogram  and  a  complete  256X256  pixel 
ROI  could  not  be  defined,  the  ROI  was  shifted  over  until  the 
appropriate  edge  coincided  with  the  border  of  the  original 
image.  The  classification  of  these  fixed  sized  ROIs  was  based 
on  a  multiresolution  texture  analysis  scheme.  The  approach 
has  been  described  in  detail  in  the  literature"^  with  the  essen¬ 
tial  steps  in  the  classification  summarized  below. 

1.  Texture  features 

The  texture  features  used  in  the  classification  were  de¬ 
rived  from  the  spatial  gray-level  dependence  (SGLD) 
matrix."'^’"^  An  element  of  the  SGLD  matrix,  ,  is  the 

joint  probability  that  the  gray  levels  /  and  j  occur  at  a  given 
interpixel  separation  d  and  direction  0.  A  set  of  SGLD  ma¬ 
trices  can  be  defined  by  vaiying  the  separation  and  direction. 
Thirteen  texture  features  were  derived  from  each  SGLD  ma¬ 
trix  including  correlation,  energy,  entropy,  inertia,  inverse 
difference  moment,  sum  average,  sum  variance,  sum  en¬ 
tropy,  difference  average,  difference  variance,  difference  en¬ 
tropy,  and  two  measures  of  correlation  information.  The 
mathematical  definitions  for  the  SGLD  features  can  be  found 
in  the  literature."'' These  features  were  selected  because 
they  were  found  to  be  effective  in  the  classification  of  ROIs 
containing  masses  or  normal  tissue  manually  identified  by 
radiologists.  Each  texture  feature  was  calculated  in 

the  ^=0°,45°,90°,135°  directions.  The  features  obtained  at 
^=0°,90°  and  ^=45°,  1 35°  were  averaged  since  no  angular 
bias  was  seen  in  the  texture  of  masses,  and  we  did  not  find 
any  significant  difference  in  classification  accuracy  between 
features  at  separate  angles  and  their  averaged  values."^  The 
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features  calculated  at  adjacent  pixels  on  axis  (^=0°,90°)  and 
those  in  the  diagonal  direction  (^=45''J35°)  were  not  aver¬ 
aged  because  of  the  significant  \/2  difference  in  the  actual 
distances. 

Before  the  texture  features  were  calculated,  background 
correction  was  performed  on  the  individual  ROIs  using  a 
method  described  previously. An  ROI  was  first  low-pass 
filtered  and  a  pixel  in  the  low-frequency  background  image 
was  estimated  as  a  weighted  sum  of  the  pixel  values  sur¬ 
rounding  the  ROI.  The  difference  between  the  original  ROI 
and  the  background  thus  reduced  the  gray-level  variation  due 
to  the  low-frequency  structured  background  within  the  ROI. 

2.  Global  multiresolution  SOLD  features 

A  wavelet  transform  with  a  four  coefficient  Daubechies 
kernel  was  used  to  decompose  the  individual  ROIs  into  mul¬ 
tiple  scales  after  background  correction.  Multiresolution 
ROI  images  were  obtained  using  the  original  ROI  (Scale  1) 
and  the  first  two  low-pass  down-sampled  approximation 
wavelets  (Scales  2  and  4,  respectively).  The  wavelet  coeffi¬ 
cients  at  Scale  8  were  obtained  by  wavelet  filtering  but  with¬ 
out  down-sampling  so  that  the  minimum  image  size  was 
maintained  at  64X64  pixels.  This  minimum  size  was  se¬ 
lected  in  order  to  reduce  the  statistical  uncertainty  when 
SOLD  matrices  of  large  pixel  distances  were  calculated  from 
the  Scale  8  wavelet  images. 

Fourteen  SOLD  matrices,  with  effective  distances  of 
<i={L2.4,8, 12, 16,20,24,28,3236,40,44,48}  pixels  relative  to 
the  original  ROI,  were  calculated  in  both  the  on-axis  {0 
=0°,90'’)  and  diagonal  (^=45°,  135°)  directions  for  each  ROI 
using  the  Scale  1,  2,  4,  and  8  wavelet  images.  Figure  5  con¬ 
tains  a  graphical  representation  of  how  the  different  wavelet 
images  were  related  to  the  different  SOLD  matrices  and  the 
different  object  features.  The  SOLD  matrices  with  ^/={  1,2, 4} 
were  calculated  using  a  pixel  distance  of  one  in  the  Scale  I, 

2,  and  4  wavelet  images,  respectively.  The  eleven  SOLD 
matrices  at  (i ={8,12, 16,20,24,28,32,36,40,44,48}  were  calcu¬ 
lated  from  the  Scale  8  wavelet  image  with  pixel  distances 
from  2  to  12  pixels.  This  process  produced  a  total  of  28 
different  SOLD  matrices  and  364  global  multidistance  tex¬ 
ture  features  for  each  ROI. 

3,  Local  multidistance  SGLD  features 

A  set  of  local  texture  features  was  also  calculated  for  each 
ROf  2j.27  p-yg  rectangular  subregions  were  segmented  from 
each  ROI;  an  object  subregion  defined  by  the  original 
DWCE  object  bounding  box  located  at  the  center  of  the  ROI, 
and  four  peripheral  subregions  at  the  comers.  For  a  given 
pixel  distance  d  and  a  given  direction  0,  an  SGLD  matrix 
was  formed  from  the  object  subregion  and  another  SGLD 
matrix  was  formed  from  the  pixel  pairs  in  the  four  peripheral 
subregions.  These  local  SGLD  matrices  were  calculated  for 
J={L2,4,8}  and  <9={0°,90°}  and  {45°,135°}.  The  thirteen 
texture  features  were  calculated  for  both  the  object  and  pe¬ 
riphery  SGLD  matrices.  A  total  of  208  local  features  were 
defined  for  each  ROI.  They  included  the  104  features  in  the 


364  Global  Features 


Fig.  5.  Graphical  representation  of  the  parameters  used  in  extracting  fea¬ 
tures  from  the  multiresolution  wavelet  images.  The  effective  pixel  distance 
^={1,2.4,8,12,16,20.24, 28,32.36,40.44, 48}  for  the  SGLD  matrices  are  rela¬ 
tive  to  the  original  image. 

object  region  and  104  additional  features  defined  as  the  dif¬ 
ference  between  the  feature  values  in  the  object  and  the  pe¬ 
riphery  regions. 

4.  Linear  discriminant  analysis 

.  Linear  discriminant  analysis  (LDA)  uses  a  set  of  feature 
variables  to  classify  an  individual  into  one  of  a  set  of  mutu¬ 
ally  exclusive  classes.^^  We  found  in  our  previous  studies 
that  the  LDA  using  SGLD  texture  features  can  effectively 
separate  masses  from  normal  tissue  using  ROIs  manually 
selected  by  radiologists. In  our  two  class  (mass  and  nor¬ 
mal  tissue)  problem,  the  set  of  572  global  and  local  texmre 
features  was  used  as  a  pool  of  predictor  variables  in  a  step¬ 
wise  selection  procedure.  This  procedure  selected  a  subset  of 
features  from  the  feature  space  based  on  the  maximization  of 
the  Mahalanobis  distance.^  ^  The  stepwise  selection  elimi¬ 
nates  irrelevant  variables  and  thus  improves  the  generaliza¬ 
tion  capability  of  a  discriminant  function  optimized  with  a 
finite  number  of  training  cases. 

With  the  DWCE  segmentation  and  object  splitting  algo¬ 
rithm,  many  of  the  extracted  ROIs  overlapped  with  one  an¬ 
other  because  of  the  adjacency  of  the  objects.  We  selected 
the  independent  ROIs  (i.e.,  the  ROIs  that  did  not  overlap 
with  one  another)  to  form  a  training  set  in  order  to  avoid 
biases  in  the  statistical  distributions  of  the  feature  vectors. 
Two  independent  sets,  Gl,  and  G2, ,  were  formed  by  reduc¬ 
ing  all  pairs  of  overlapping  ROIs  to  single  regions  in  G1  and 
G2,  respectively.  If  a  true  mass  ROI  overlapped  with  a  nor¬ 
mal  tissue  ROI,  the  true  mass  region  was  saved  while  the 
normal  region  was  eliminated.  If  two  normal  regions  over¬ 
lapped,  one  randomly  selected  region  was  eliminated.  Fi¬ 
nally,  if  two  regions  containing  the  full  breast  mass  over¬ 
lapped,  the  region  defined  by  the  DWCE  segmented  object 
which  contained  the  centroid  of  the  true  mass  was  saved 
while  the  other  was  eliminated.  These  independent  Gl^  and 
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Table  I.  The  number  of  detected  objects,  the  single  stage  reduction,  the 
mean  object  area  and  the  standard  deviation  of  the  object  areas 

(^Area)  the  Gl  data  set  after  the  global,  local,  and  splitting  stase  ET^ 
reduction  steps.  The  single  stage  reduction  is  defined  as  the  reduction 
achieved  by  the  morphological  FP  reduction  block  in  each  stace. 


Stage 

TP 

detections 

FP  detections 
per  image 

Single  stage 
reduction 

/^.\rea 

(pixels) 

TArca 

(pi.xels) 

Global 

82  of  84 

34.6 

25% 

63.3 

109.0 

Local 

81  of  84 

12.4 

75% 

286.4 

351.6 

Split 

81  of  84 

18.9 

14% 

122.0 

122.1 

Table  III.  The  number  of  FPs  per  image  of  each  FROG  curv'e  at  90%  and 
80%  TP  detection  fractions. 


Training 

set 

Test 

set 

FPs  per  image 
(90%  TP  fraction) 

FPs  per  image 
(80Vc  TP  fraction) 

Gl, 

Gl 

3.77 

1.88  ^ 

G2, 

G2 

4.55 

1.47 

G2, 

Gl 

3.98 

2.50 

Gl, 

G2 

4.72 

2.08 

G2;  sets  were  individually  used  to  train  the  LDA  classifiers 
while  the  full  Gl  and  G2  sets  were  used  for  classifier  evalu¬ 
ation. 

To  improve  the  statistical  properties  of  the  feature  distri¬ 
butions,  we  used  the  entire  set  of  segmented  ROIs  from  both 
the  Gl;  and  G2,-  image  sets  for  selection  of  feature  variables. 
After  feature  selection,  the  G 1  and  G2  groups  were  used 
alternately  as  training  and  test  sets.  For  example,  when  the 
coefficients  of  the  linear  discriminant  function  were  opti¬ 
mized  by  the  feature  values  from  the  Gl,  set,  the  classifica¬ 
tion  accuracy  of  the  linear  discriminant  function  was  tested 
with  the  full  G2  set.  The  Gl, -trained  linear  discriminant 
function  was  also  applied  to  the  full  Gl  group  to  evaluate  its 
self-consistency.  Therefore,  a  total  of  four  groups  of  dis¬ 
criminant  scores  were  obtained:  {TrainrGl,,  Test:Gl}, 
{Train:G2, ,  Test:G2},  {Train;G2, ,  TestiGl},  and  {TraimGl  ! 
Test:G2}. 

In  this  study,  FROG  analysis'*'  was  used  to  evaluate  the 
performance  of  the  complete  segmentation  method.  The 
tradeoff  between  the  TP  fraction  and  the  number  of  FP  de¬ 
tections  per  image  was  determined  by  varying  the  decision 
threshold  on  the  ROI  discriminant  scores.  The  raw  detection 
data  for  both  the  full  group  training  and  test  cases  are  pre¬ 
sented,  along  with  the  fitted  FROG  curves  obtained  using  the 
FROCFIT  program.^^ 

III.  RESULTS 

The  number  of  TP  and  FP  objects  detected  in  the  global 
and  local  stages  of  the  DWGE  segmentation  are  summarized 
in  Tables  I  and  II  for  the  Gl  and  G2  image  sets,  respectively. 
A  TP  detection  for  the  DWGE  segmentation  is  again  simply 
defined  as  an  object  locating  the  centroid  of  a  breast  mass, 
and  a  FP  is  any  object  other  than  the  true  mass  (as  discussed 

T.^ble  II.  The  number  of  detected  objects,  the  single  stage  reduction,  the 
mean  object  area  and  the  standard  deviation  of  the  object  areas 

(<rArta)  for  the  G2  data  set  after  the  global.  local,  and  splitting  stage  FP 
reduction  steps.  The  single  stage  reduction  is  defined  as  the  reduction 
achieved  by  the  morphological  FP  reduction  block  in  each  statze. 


Stage 

TP 

detections 

FP  detections 
per  image 

Single  stage 
reduction 

A-Area 

(pixels) 

^Area 

(pixels) 

Global 

79  of  84 

32.9 

32% 

64.4 

112.4 

Local 

79  of  84 

21.4 

62% 

219.8 

289.6 

Split 

79  of  84 

21.6 

7% 

108.2 

91.2 

in  Sec.  II  A).  The  two-stage  DWGE  segmentation  missed 
only  8  of  the  168  breast  masses  contained  in  the  entire  image 
set.  Using  the  sets  of  TP  and  FP  objects,  256X256  pixel 
ROIs  representing  each  of  the  detected  objects  were  ex¬ 
tracted  from  the  full  resolution  mammograms.  A  total  of 
1690  ROIs  were  extracted  from  the  set  of  84  Gl  images  and 
1874  from  the  G2  mammograms.  The  independent  6 1,  and 
G2i  sets  used  for  LDA  training  included  476  and  503  non¬ 
overlapping  ROIs,  respectively.  Stepwise  feature  selection 
was  then  performed  on  the  572  multidistance  texture  features 
using  the  combined  Gl,-  and  G2,-  image  sets,  as  described 
above,  and  29  features  were  selected.  These  29  features  were 
used  in  the  LDA  texture  classification  for  training  and  testing 
both  the  Gl  and  G2  image  sets.  Figures  6  and  7  show  the 
raw  and  fitted  training  FROG  curves  obtained  using  the  LDA 
texture  classifier  for  the  {Train:Gl,-,  TestrOl}  and 
{Train:G2, ,  Test:G2}  combinations.  The  raw  and  fitted 
FROG  curves  for  the  test  sets,  {Train:G2,-,  TestiGl}  and 
{Train.-Gl,  ,  Test:G2},  are  likewise  depicted  in  Figs.  8  and  9. 
Finally,  Table  III  contains  the  raw  FROG  results  at  TP  de¬ 
tection  rates  of  90%  and  80%,  and  Table  IV  contains  the 
FROCFIT  program  parameters  estimated  for  each  of  the  fitted 
FROG  curves. 

IV.  DISCUSSION 
A.  DWCE  segmentation 

The  purpose  of  the  global  processing  stage  was  to  define  a 
set  of  local  regions  which  contained  the  true  breast  masses 
and  as  few  normal  regions  as  possible.  The  initial  DWGE 
filtering  and  subsequent  edge  detection  was  able  to  detect 
161  of  the  168  true  masses  in  this  preliminary  study,  includ¬ 
ing  83  of  the  85  malignant  masses.  In  addition,  five  of  the 
seven  missed  masses,  including  both  malignant  masses,  were 

Table  IV,  Summary  of  the  frocfit  parameters  and  goodness  of  fit  values. 
The  headings  for  the  table  are;  the  two  estimated  frocfit  parameters  (a  and 
b).  the  standard  deviation  of  the  estimated  parameters  ((7„  and  a*),  the  area 
under  the  alternative  FROC  curve  1/tAFROc)-  standard  deviation  of  the 
area  (o-^),  the  normalized  chi-squared  value  ()c).  and  the  significance  prob¬ 
ability  for  the  fit  (Prob). 


Training 

set 

Test 

set 

a 

b 

^  AFROC 

^^4 

r 

Prob 

Gl/ 

Gl 

0.19 

0.11 

0.50 

0.05 

0.57 

0.04 

1.39 

0.04 

G2, 

G2 

0.28 

o.n 

0.47 

0.05 

0.61 

0.04 

0.92 

0.61 

G2, 

Gl 

O.Il 

O.li 

0.58 

0.05 

0.54 

0.04 

0.96 

0.55 

Gl, 

G2 

O.il 

O.il 

0.45 

0.04 

0.54 

0.04 

1.03 

0.42 
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Fig.  6.  FROC  curves  obtained  with  the  image  group  {Train  Gl,- ,  Test  Gl}. 
The  data  points  are  raw  data  obtained  by  varying  the  decision  threshold  on 
the  discriminant  scores.  The  solid  curve  is  obtained  from  the  frocftt 
program. 

detected  in  another  mammogram  containing  a  different  view 
of  the  same  breast.  The  image  set  did  not  include  any  addi¬ 
tional  views  for  the  two  remaining  misses.  This  indicates  that 
the  global  stage  is  effective  in  the  initial  detection  task.  How¬ 
ever,  the  morphological  properties  of  the  detected  regions 
proved  to  be  of  limited  value  in  differentiating  between  TP 
and  FP  objects  in  the  low-resolution  DWCE  filtered  images. 
The  main  problem  was  that  the  global  detection  underesti¬ 
mated  the  size  of  the  actual  structures.  This  can  be  clearly 
seen  in  Fig.  3(g)  where  the  detected  objects  are  usually  much 
smaller  than  the  actual  structures  in  the  image.  The  average 
size,  after  FP  reduction,  of  the  global  stage  objects  was  64.4 
pixels.  This  underestimation  can  be  mainly  attributed  to  the 
large  intensity  range  over  which  the  background  suppression 


Number  FPs/lmage 


Fig.  7.  FROC  curves  obtained  with  the  image  group  (Train  G2,- ,  Test  G2}, 
The  data  points  are  raw  data  obtained  by  varying  the  decision  threshold  on 
the  discriminant  scores.  The  solid  curve  is  obtained  from  the  frocftt 
program. 
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Fig.  8.  FROC  curves  obtained  with  the  image  group  (Train  G2,  ,  Test  Gl}. 
The  data  points  are  raw  data  obtained  by  varying  the  decision  threshold  on 
the  discriminant  scores.  The  solid  curve  is  obtained  from  the  frocftt 
program. 

was  defined.  This  leads  to  inaccuracies  in  the  object  borders 
affecting  the  morphological  features  and  reducing  the  effec¬ 
tiveness  of  the  FP  reduction.  The  morphological  features  and 
sequential  classification  were  still  able  to  achieve  a  29%  re¬ 
duction  in  the  initial  number  of  regions,  but  at  the  end  of  the 
global  stage  an  average  of  34  detected  regions  per  image 
across  the  Gl  and  G2  sets  still  remained.  In  further  analysis 
of  the  detected  regions,  it  was  observed  that  fatty  breasts  had 
relatively  few  detected  structures  while  mammograms  con¬ 
taining  dense  tissue  had  a  much  larger  number  of  regions. 

The  limitations  of  the  global  stage  were  partially  over¬ 
come  by  repeating  the  filtering  and  edge  detection  in  the 
local  regions  identified  in  the  global  stage.  By  allowing  the 
DWCE  filter  to  adapt  to  the  background  within  these  much 


Number  FPs/lmage 


Fig.  9.  FROC  curves  obtained  with  the  image  group  (Train  Gl,- ,  Test  G2}. 
The  data  points  are  raw  data  obtained  by  varying  the  decision  threshold  on 
the  discriminant  scores.  The  solid  curve  is  obtained  from  the  frocftt 
program. 
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smaller  regions,  better  estimates  for  the  true  borders  of  the 
mammographic  structures  were  achieved  without  sacrificing 
true  mass  detections.  The  local  DWCE  stage  was  able  to 
detect  160  of  the  168  true  masses  in  this  preliminary  study 
where,  again,  83  of  the  85  malignant  masses  were  detected. 
The  additional  missed  mass  did  not  come  from  the  DWCE 
filtering  and  edge  detection  but  was  instead  lost  in  the  local 
FP  reduction.  This  mass  was  detected  in  a  different  mammo¬ 
gram  from  our  image  set  which  contained  a  different  view  of 
the  same  breast.  It  is  evident  by  comparing  Figs.  3(g)  and 
3(h)  that  the  detected  objects  in  the  local  stage  match  the  true 
borders  better  than  the  global  stage  objects.  The  average  area 
of  the  detected  objects  following  the  local  FP  reduction  in¬ 
creased  to  253  pixels  from  the  64.4  pixels  following  the  glo¬ 
bal  stage.  The  more  accurate  borders  help  improve  the  local 
FP  reduction  which  provided  a  69%  average  reduction  in  the 
initial  number  of  local  FP  regions  and  a  corresponding  50% 
reduction  in  the  number  of  FPs  from  the  output  of  the  global 
stage.  The  number  of  detected  regions  following  the  local  FP 
reduction  was  still  quite  large,  with  an  average  of  16.9  re¬ 
gions  detected  per  image.  This  large  number  of  regions  can 
be  attributed  to  two  factors.  First,  while  improving  the  object 
border  estimates,  the  local  processing  still  continued  to  un¬ 
derestimate  their  true  size  [see  Fig.  3(h)].  This  limited  the 
effectiveness  of  the  morphological  FP  reduction  in  distin¬ 
guishing  between  the  masses  and  many  of  the  normal  struc¬ 
tures.  In  addition,  the  expanded  object  area  was  attributed 
not  only  to  the  more  precise  edge  characterization  but  also  to 
the  merging  of  neighboring  regions  into  single  detected  ob¬ 
jects.  The  merged  objects  caused  problems  in  the  final  tex¬ 
ture  analysis  stage  because  the  texture  information  for  the 
mass  regions  was  often  averaged  with  large  amounts  of  nor¬ 
mal  tissue,  thus  increasing  the  likelihood  that  the  true  breast 
masses  would  be  missed.  Object  splitting  partially  solved  the 
problem  of  merged  regions  by  estimating  merge  points  ac¬ 
cording  to  geometrical  shape.  However,  some  distortion  of 
the  morphological  features  remained.  Splitting  also  inadvert¬ 
ently  introduced  additional  FPs.  In  this  study  the  number  of 
FPs  increased  from  16.9  FPs/image  after  local  reduction  to 
20.3  FPs/image  after  the  splitting  reduction  step.  While  the 
results  of  this  preliminary  study  indicate  that  the  DWCE  seg¬ 
mentation  is  effective  in  detecting  breast  masses,  further  im¬ 
provements  in  the  scheme  will  be  necessary  to  reduce  the 
total  number  of  detected  regions. 

Closer  evaluation  of  the  images  where  a  mass  was  not 
detected  highlighted  a  problem  in  the  initial  rescaling  of 
some  images.  As  stated  previously,  the  initial  rescaling  step 
in  the  DWCE  [refer  to  Fig.  1(a)]  is  very  important  because  it 
allowed  a  single  set  of  filters  to  be  applied  uniformly  to  all 
the  mammograms.  The  rescaling  should  have  occurred  only 
within  the  breast  region  of  the  image.  We  have  found  that  the 
initial  breast  map  included  a  strip  of  pixels  belonging  to  a 
bright  edge  outside  the  breast  region  of  the  mammogram  in 
tw’o  of  the  images  with  missed  masses.  The  pixels  in  this 
strip  had  a  higher  intensity  than  any  of  the  other  pixels  in  the 
breast  region  of  the  mammogram,  and  their  inclusion  in  the 
rescaling  caused  many  lower-intensity  objects  to  be  missed. 


When  this  strip  was  removed  from  the  two  images,  the 
masses  were  detected. 

B.  Morphological  classification 

Another  important  factor  that  affects  the  FP  reduction  is  the 
choice  of  morphological  features.  The  eleven  morphological 
features  used  in  this  study  were  selected  because  individually 
they  showed  some  potential  in  differentiating  between 
shapes.  However,  they  are  probably  not  the  optimal  set  of 
morphological  features  for  this  task.  The  best  subgroup  of 
the  features  was  found  to  be  the  area,  perimeter-to-area  ratio, 
and  the  contrast  which  provided  the  best  BPN  classifier  per¬ 
formance.  No  general  conclusions  from  this  preliminary 
study  can  be  made  about  the  applicability  of  the  individual 
features  because  of  the  small  size  of  the  image  set  and  the 
suboptimal  border  information  provided  by  the  DWCE 
detection. 

The  morphological  classification  is  an  important  compo¬ 
nent  in  the  overall  FP  reduction.  In  this  study,  we  selected 
the  sequential  application  of  a  thresholding,  an  LDA,  and  a 
BPN  classifier.  The  order  of  application  was  found  to  be 
important.  The  investigation  showed  that  the  LDA  and  espe¬ 
cially  the  BPN  classifier  were  trained  faster  and  performed 
better  when  the  initial  number  of  FPs  in  the  training  set  was 
small,  thus  leading  to  the  use  of  the  sequential  classification 
scheme.  We  have  not  presented  the  exact  values  of  the  fixed 
thresholds  used  in  this  study  because  of  the  small  size  of  this 
preliminary  image  set.  With  a  larger,  more  representative 
training  set,  the  particular  threshold  values  will  need  to  be 
adjusted.  Therefore,  we  have  instead  concentrated  on  de¬ 
scribing  the  general  methodology  for  selecting  the  individual 
thresholds,  as  outlined  in  Sec.  II B. 

C.  Texture  classification 

The  large  number  of  regions  detected  in  the  DWCE  segmen¬ 
tation  precipitated  the  need  for  additional  FP  reduction.  This 
additional  reduction  was  achieved  by  classifying  with  multi¬ 
resolution  texture  features  extracted  from  the  DWCE  de¬ 
tected  regions.  The  LDA  classification  using  SOLD  features 
was  selected  because  it  was  found  to  be  effective  in  differ¬ 
entiating  breast  masses  from  normal  tissue  in  regions  identi¬ 
fied  by  radiologists. Again,  we  have  not  presented  details 
about  the  panicular  feature  selected  because  of  the  small  size 
of  the  data  set.  However,  a  detailed  discussion  of  the  multi¬ 
resolution  texture  features  and  the  LDA  texture  classification 
method  can  be  found  in  the  literature.  The  texture  clas¬ 
sification  in  this  final  step  resulted  in  an  average  of  4.4  FPs/ 
image  at  a  90%  TP  rate  and  2.3  FPs/image  at  an  80%  TP  rate 
(Table  III)  in  the  test  sets.  These  results  indicate  that  the 
overall  system  (i.e.,  DWCE  segmentation  plus  LDA  texture 
classification)  is  capable  of  automatically  detecting  breast 
masses  on  digitized  mammograms.  Table  III  also  indicates 
that  the  Gl  and  G2  image  sets  were  reasonably  well 
matched.  The  G2  set  provided  slightly  better  performance  at 
a  90%  TP  fraction  but  the  Gl  set’s  performance  was  better 
for  the  80%  detection  level. 
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Figures  6-9  contain  fitted  FROC  curv'es  obtained  using 
the  FROCFIT  program  developed  by  Chakraborty  et  al}~ 
Table  IV  contains  the  estimated  fit  parameters  and  the  good¬ 
ness  of  fit  characteristics  obtained  with  the  program.  The 
fitted  curves  match  well  visually  with  the  raw  FROC  results 
and  the  normalized  )C  goodness  of  fits  only  varied  from  0.92 
to  1.39  (optimal  value  is  1.0).  This  indicates  that  the  FROCFIT 
program  may  be  able  to  fit  our  raw  FROC  results.  However, 
it  is  likely  the  signals  detected  by  our  method  do  not  satisfy 
the  assumptions  that  the  occurrence  of  an  FP  follows  Poisson 
statistics  and  that  the  FPs  are  independent.  Further  studies 
are  therefore  needed  to  investigate  if  the  good  fit  observed 
occurs  by  chance  and  if  the  area  under  the  alternative  FROC 
curve  (A  j^pRoc)  can  be  used  as  an  indication  of  the  overall 
performance  of  the  classification  system. 

D.  Future  studies 

Our  results  indicate  that  DWCE  segmentation  can  be  used 
to  effectively  detect  breast  structures  on  a  mammogram.  The 
flexible  form  of  the  DWCE  filter  leaves  open  the  possibility 
that  further  optimization  of  the  detection  parameters  may 
improve  overall  performance.  Evaluation  of  different  DWCE 
filters  (e.g.,  modifying  K;^  and  K^i)  will  be  pursued  in  fu¬ 
ture  studies. 

One  of  the  difficulties  in  the  DWCE  segmentation  method 
is  the  merging  of  regions  and  the  subsequent  need  to  split 
objects.  The  splitting  operation  increases  the  number  of  false 
regions  and  also  adversely  affects  the  morphological  infor¬ 
mation  by  introducing  straight  edges  at  the  split  locations.  In 
future  studies,  we  will  investigate  alternative  methods  for 
separating  merged  structures.  Gray-level  information  will  be 
used  in  conjunction  with  binary  shape  information  to  guide 
the  splitting.  The  sequential  change  in  shape  obtained  by 
region  growing  at  different  local  threshold  levels  will  more 
precisely  define  multiple  regions  within  a  single  DWCE  seg¬ 
mented  object.  This  approach  should  improve  the  morpho¬ 
logical  features  of  the  split  objects  and  increase  the  classifi¬ 
cation  accuracy  of  masses  and  normal  tissue,  thereby 
reducing  the  FP  detections.  Furthermore,  a  fundamental  im¬ 
provement  in  the  adaptivity  of  the  DWCE  segmentation  will 
be  needed  to  reduce  the  number  of  objects  extracted  in  the 
initial  stage.  One  possible  improvement  may  be  accom¬ 
plished  by  first  classifying  the  breast  parenchyma  into  differ¬ 
ent  types  (e.g.,  fatty,  mixed,  or  dense).  The  DWCE  filter 
parameters  can  then  be  optimized  specifically  for  each  tissue 
type.  This  would  allow  better  background  suppression,  and 
more  precise  object  extraction  in  different  types  of  breast 
parenchyma.  It  can  be  expected  that  the  initial  number  of  FP 
objects  detected  in  dense  breasts  will  be  reduced  without 
impacting  the  detection  on  fatty  breasts. 

Our  detection  scheme  makes  use  of  information  on  a 
single  mammogram.  In  mammographic  interpretation,  it  has 
been  found  that  symmetry  information  on  the  left  and  right 
mammograms  of  the  same  view  often  improves  the  detection 
of  subtle  abnormal  tissue  density. The  information  can  also 
be  used  to  eliminate  FP  detections  when  they  appear  on  both 
mammograms  in  symmetrical  locations.^®  However,  the 


symmetry  information  should  be  used  with  caution  because 
many  patient  mammograms  are  not  highly  symmetrical  due 
to  variations  in  compression  and  imaging  techniques,  as  well 
as  the  natural  asymmetry  in  tissue  structures.  We  will  inves¬ 
tigate  the  effectiveness  of  the  symmetry  information  from 
paired  mammograms  in  FP  reductions  in  future  studies. 

V.  CONCLUSION 

We  have  developed  an  image  enhancement  technique 
which  can  adaptively  suppress  the  low-frequency  structured 
background  and  enhance  the  contrast  of  structures  on  an  im¬ 
age.  The  technique  was  applied  to  the  segmentation  step  in  a 
CAD  program  for  detection  of  breast  masses.  It  was  found  to 
be  effective  in  enhancing  masses  and  normal  tissue  structures 
on  mammograms.  To  further  distinguish  between  masses  and 
normal  tissue,  the  potential  mass  regions  were  classified  with 
an  LDA  using  multiresolution  texture  features  extracted  from 
wavelet  coefficients  at  several  scales.  Results  of  FROC 
analysis  indicate  that  the  current  algorithm  can  achieve  a  TP 
rate  of  90%  at  4.4  FPs/image  and  a  TP  rate  of  80%  at  2.3 
FPs/image.  The  consistency  in  the  performance  of  the  algo¬ 
rithm  was  verified  by  training  and  testing  two  independent 
data  sets.  This  study  demonstrates  the  feasibility  of  our  ap¬ 
proach  to  computer-assisted  detection  of  masses  in  mammo¬ 
graphic  interpretation.  Further  investigations  are  under  way 
to  improve  the  detection  accuracy  and  test  its  performance  in 
large  data  sets. 
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We  are  developing  a  computer  program  for  automated  detection  of  clustered  microcaicifications  on 
mammograms.  In  this  study,  we  investigated  the  effectiveness  of  a  signal  classifier  based  on  a 
convolution  neural  network  (CNN)  approach  for  improvement  of  the  accuracy  of  the  detection 
program.  Fifty-two  mammograms  with  clustered  microcaicifications  were  selected  from  patient 
files.  The  clusters  on  the  mammograms  were  ranked  by  experienced  mammographers  and  divided 
into  an  obvious  group,  an  average  group,  and  a  subtle  group.  The  average  and  subtle  groups  were 
combined  and  randomly  divided  into  two  sets,  each  of  which  was  used  as  training  or  test  set 
alternately.  The  obvious  group  served  as  an  additional  independent  test  set.  Regions  of  interest 
(ROIs)  containing  potential  individual  microcaicifications  were  first  located  on  each  mammogram 
by  the  automated  detection  program.  The  ROIs  from  one  set  of  the  mammograms  were  used  to  train 
CNNs  of  different  configurations  with  a  back-propagation  method.  The  generalization  capability  of 
the  trained  CNNs  was  then  examined  by  their  accuracy  of  classifying  the  ROIs  from  the  other  set 
and  from  the  obvious  group.  The  classification  accuracy  of  the  CNNs  for  the  ROIs  was  evaluated 
by  receiver  operating  characteristic  (ROC)  analysis.  It  was  found  that  CNNs  of  many  different 
configurations  can  reach  approximately  the  same  performance  level,  with  the  area  under  the  ROC 
curve  (A.)  of  0.9.  We  incorporated  a  trained  CNN  into  the  detection  program  and  evaluated  the 
improvement  of  the  detection  accuracy  by  the  CNN  using  free  response  ROC  analysis.  Our  results 
indicated  that,  over  a  wide  range  of  true-positive  (TP)  cluster  detection  rate,  the  CNN  classifier 
could  reduce  the  number  of  false-positive  (FP)  clusters  per  image  by  more  than  70%.  For  the 
obvious  cases,  at  a  TP  rate  of  100%,  the  FP  rate  reduced  from  0.35  cluster  per  image  to  0.1  cluster 
per  image.  For  the  average  and  subtle  cases,  the  detection  accuracy  improved  from  a  TP  rate  of  87% 
at  an  FP  rate  of  four  clusters  per  image  to  a  TP  rate  of  90%  at  an  FP  rate  of  1.5  clusters  per  image. 


Key  words:  mammography,  microcalcification,  computer-aided  diagnosis,  artificial  neural 
network,  receiver  operating  characteristic  (ROC)  analysis 


L  INTRODUCTION 

In  the  United  States,  breast  cancer  is  the  leading  cause  of 
death  in  women  between  40  and  55  yr  of  age.^  One  out  of 
eight  women  will  develop  breast  cancer  in  their  lifetime.” 
Studies  have  indicated  that  early  detection  and  treatment  im¬ 
prove  the  chances  of  survival  for  breast  cancer  patients.  At 
present,  mammography  is  the  only  proven  method  that  can 
_detect  minimal  breast  cancers. However,  10% -30%  of  the 
breast  cancers  that  are  visible  on  mammograms  in  retrospec¬ 
tive  studies  are  not  detected  due  to  various  technical  or  hu¬ 
man  factors.^'^  Double  reading  can  reduce  the  miss  rate  on 
radiographic  reading.^*^  It  has  also  been  shown  that 
computer-aided  diagnosis  (CAD),  in  which  a  computer  alerts 
radiologists  to  suspicious  locations  on  the  images  during 
mammographic  reading,  can  improve  the  detection  accuracy 
significantly.^*  ’”  CAD  is  thus  a  viable  cost-effective  alterna¬ 
tive  to  double  reading  by  radiologists. 

One  of  the  imponant  indicators  of  the  presence  of  breast 
cancers  is  clustered  microcaicifications.*^  Clustered  micro¬ 
caicifications  can  be  seen  on  mammograms  in  30%-50%  of 
breast  cancers.*"*"*^  It  is  difficult  to  detect  subtle  microcaici¬ 
fications  because  of  the  noisy  mammographic  background.  A 
number  of  research  groups  have  been  developing  CAD  pro¬ 
grams  for  the  detection  of  microcaicifications.  Chan 
^^^^  *1,18.19  demonstrated  that  a  difference-image  technique 


can  effectively  detect  microcaicifications  on  digitized  mam¬ 
mograms.  Fam  et  alr^  and  Davies  et  al?^  detected  microcai¬ 
cifications  using  conventional  image  processing  techniques. 
Qian  et  alr^  recently  devised  a  tree- structure  filter  and  wave¬ 
let  transform  for  enhancement  of  microcaicifications  to  fa¬ 
cilitate  detection.  Other  groups  extracted  morphological  fea¬ 
tures  such  as  contrast,  size,  shape,  and  edge  gradient  of 
microcaicifications,  and  classified  them  with  various  feature 
classifiers. ^^"^*  Wu  et  al  scanned  for  suspected  microcaici¬ 
fications  with  the  difference-image  technique*^  then  further 
classified  true  and  false  detections  by  an  artificial  neural  net¬ 
work  based  on  features  extracted  from  their  power  spectra.^” 
Similarly,  Zhang  et  used  a  shift-invariant  neural  net¬ 
work  to  reduce  false-positive  microcaicifications.  The  results 
reported  in  all  these  studies  appear  to  be  encouraging  for  the 
selected  datasets. 

In  this  study,  we  trained  a  convolution  neural  network 
(CNN)  to  recognize  mammographic  microcaicifications.  The 
CNN  was  first  developed  for  the  detection  of  pulmonary 
nodules  on  chest  radiographs.^"*  This  neural  network  is  dif¬ 
ferent  from  the  commonly  used  back-propagation  neural  net¬ 
work  in  that  its  input  is  a  region  of  interest  (ROI)  from  the 
image  instead  of  extracted  image  features.  It  is  also  different 
from  the  shift-invariant  neural  network  used  by  Zhang 
et  al?^  in  that  the  input  ROI  to  the  CNN  includes  an  indi¬ 
vidual  microcalcification  instead  of  a  cluster,  and  that  the 
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output  of  the  CNN  is  a  decision  score  for  determination  of 
the  presence  of  a  microcalcification  instead  of  a  processed 
image  ROI.  Therefore,  with  our  approach  no  further  image 
processing  techniques  such  as  thresholding  and  region  grow¬ 
ing  have  to  be  applied  to  an  output  ROI  to  determine  if  a 
microcalcification  is  present.  We  have  incorporated  the 
trained  CNN  into  our  detection  program  and  its  effectiveness 
is  evaluated  by  the  improvement  in  the  overall  detection  ac¬ 
curacy  of  the  CAD  program. 

11.  MATERIALS  AND  METHODS 

A.  Case  selection 

In  this  study,  we  used  mammograms  that  contained  clus¬ 
tered  microcalcifications  as  case  samples.  The  mammograms 
were  selected  from  the  patient  files  in  the  Department  of 
Radiology  at  the  University  of  Michigan  Hospitals  by  expe¬ 
rienced  mammographers.  The  mammograms  were  acquired 
with  a  dedicated  mammographic  system  with  a  0.3  mm  focal 
spot,  molybdenum  (Mo)  anode  and  0.03  mm  Mo  filter,  and  a 
5:1  reciprocating  grid.  Kodak  Min  R/MRE  mammographic 
screen/film  system  using  extended  cycle  processing  was  em¬ 
ployed  as  the  image  receptor.  The  presence  of  the  clustered 
microcalcifications  and  the  histology  for  each  case  had  been 
verified  by  biopsy.  The  case  samples  included  a  mixture  of 
benign  and  malignant  cases.  However,  in  this  study,  we  con¬ 
centrated  on  the  detection  rather  than  the  classification  of  the 
malignant/benign  nature  of  the  microcalcifications. 

Fifty-two  mammograms  were  selected  for  this  study.  Each 
mammogram  was  ranked  by  the  radiologist  regarding  the 
visibility  of  the  cluster  of  microcalcifications  on  a  rating 
scale  of  1-5  (l=very  obvious,  5= very  subtle).  The  scale 
was  established  subjectively  relative  to  the  cases  encountered 
in  clinical  practice  in  our  hospitals.  After  ranking,  we  divided 
the  52  mammograms  into  three  groups:  the  mammograms  of 
ratings  1  and  2  were  referred  to  as  the  obvious  group  (N 
=  14),  the  mammograms  of  rating  3  as  the  average  group 
(A  =16),  and  the  mammograms  of  ratings  4  and  5  as  the 
subtle  group  (A=22).  Although  this  classification  was  very 
subjective,  it  was  an  attempt  to  demonstrate  the  dependence 
of  the  performance  of  the  CAD  program  on  the  database.  We 
also  attempted  to  describe  quantitatively  the  physical  charac¬ 
teristics  of  the  microcalcifications  on  the  digitized  image  and 
correlated  them  with  the  visual  ratings.  We  extracted  digi¬ 
tally,  as  discussed  below,  the  contrast,  the  size,  and  the 
signal-to-noise  ratio  (SNR)  of  the  individual  microcalcifica¬ 
tions.  The  mean  and  standard  deviation  (SD)  of  these  physi¬ 
cal  characteristics  of  the  microcalcifications  in  each  group 
were  compared. 

B.  Digitization  of  mammograms 

All  mammograms  were  digitized  with  a  laser  film  scanner 
(LUMISYS  DIS-1000),  with  both  the  sampling  distance  and 
the  nominal  spot  size,  and  thus  the  pixel  size,  chosen  to  be 
0.1  mmXO.l  mm.^^  The  digitizer  has  a  gray  level  resolution 
of  12  bits  and  an  optical  density  (O.D.)  range  of  0-3.5.  It 
was  calibrated  so  that  the  O.D.  on  film  was  linearly  propor¬ 
tional  to  output  pixel  values  in  the  range  of  about  0. 1  O.D.  to 
2.8  O.D.  at  0.001  O.D./pixel  value.  The  slope  of  the  calibra¬ 


tion  curve  outside  this  range  decreased  gradually.  Before  in¬ 
put  to  the  detection  program,  the  pixel  values  were  linearly 
converted,  such  that  low  optical  density  was  represented  by 
high  pixel  values. 

To  establish  a  “truth”  file  with  which  the  computer  detec¬ 
tion  results  could  be  compared,  we  determined  the  true  loca¬ 
tions  of  the  individual  microcalcifications  on  each  mammo¬ 
gram  manually.  The  digitized  image  was  displayed  on  a 
workstation  and  the  region  containing  the  cluster  of  micro¬ 
calcifications  was  enlarged  to  full  resolution.  Each  individual 
microcalcification  on  the  displayed  image  was  identified 
carefully  by  comparison  with  the  mammogram  on  film  with 
a  magnifier.  The  coordinates  of  the  microcalcifications  were 
then  determined  by  a  cursor  and  stored  in  the  “truth”  file. 
The  same  regional  clustering  procedure  as  that  used  in  the 
detection  program  described  below  was  applied  to  the 
“truth”  file  to  determine  the  coordinates  of  the  centroid  of 
the  clusters.  These  coordinates  were  used  for  scoring  the 
detection  of  the  clusters  by  the  automated  procedure. 

It  may  be  noted  that  the  “truth”  file  thus  determined  may 
not  be  the  absolute  truth  because  of  the  difficulties  and  un¬ 
certainties  in  detecting  subtle  microcalcifications  that  are 
near  the  human  visual  threshold.  However,  this  is  the  best 
available  and  practical  method.  Neither  histologic  analysis 
nor  specimen  radiographs  can  be  used  to  identify  individual 
microcalcifications  seen  on  mammograms  because  of  the 
very  different  geometry  and  image  quality  obtained  with 
these  techniques.  Magnification  mammograms  are  often  not 
available  since  magnification  is  not  performed  for  every  case 
or  for  all  views. 


C.  Extraction  of  signal  characteristics 

To  describe  quantitatively  the  physical  characteristics  of  the 
microcalcifications  on  the  digitized  image,  we  have  devel¬ 
oped  a  signal  extraction  program  to  determine  the  size,  con¬ 
trast,  SNR  of  the  microcalcifications  from  an  unprocessed 
image  based  on  the  coordinate  of  each  individual  microcal¬ 
cification  in  the  “truth”  file.^^  In  a  51X51  pixel  ROI  cen¬ 
tered  at  each  signal  site,  the  structured  background  is  esti¬ 
mated  by  polynomial  curve  fitting  in  the  x  and  y  directions. 
The  fitted  pixel  values  in  the  x  and  y  directions  at  the  same 
pixel  are  averaged.  The  process  may  be  performed  more  than 
one  time  to  reach  a  well-fitted  smooth  surface.  The  central 
IX I  pixels  in  the  region  which  contain  the  signal  are  ex¬ 
cluded  from  the  curve  fitting  and  noise  estimation.  The  size  / 
is  chosen  to  be  a  constant  that  is  larger  than  the  diameters  of 
the  microcalcifications  of  interest  yet  much  smaller  than  51 
pixels.  After  subtraction  of  the  structured  background,  the 
local  root-mean-square  (RMS)  noise  is  calculated.  A  local 
threshold  gray  level  is  determined  as  the  product  of  the  RMS 
noise  and  an  input  SNR  threshold.  With  a  region  growing 
technique,  the  signal  region  is  then  extracted  as  the  con¬ 
nected  pixels  above  the  threshold  around  the  manually  iden¬ 
tified  signal  location.  The  size  of  the  microcalcification  is 
estimated  as  the  number  of  pixels  in  the  signal  region.  The 
contrast  is  defined  as  the  maximum  pixel  value  in  the  signal 
region  after  subtracting  the  background.  The  SNR  of  the  mi¬ 
crocalcification  is  the  ratio  of  the  contrast  to  the  local  RMS 
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,  noise,  "fhe  thresholded  image  of  the  microcalcifications  su¬ 
perimposed  on  a  background  of  constant  pixel  values  can 
also  be  displayed  for  visual  comparison. 

D.  Computerized  detection  of  microcalcifications 

We  have  developed  a  computer  program  that  can  automati¬ 
cally  detect  microcalcifications  on  mammograms.  The  pro¬ 
gram  has  been  described  in  the  literature. Briefly,  there 
are  three  major  steps  in  the  algorithm:  preprocessing,  seg¬ 
mentation,  and  classification.  In  the  preprocessing  step,  an 
edge  detector  detects  the  breast  boundary  and  divides  the 
image  into  two  regions,  one  internal  and  the  other  external  to 
the  breast.  Signal  detection  is  applied  only  to  the  region 
within  the  breast.  A  signal-enhancement  filter  (1X1  kernel)  is 
employed  to  enhance  the  microcalcifications  and  a  signal- 
suppression  filter  (box-rim  filter  with  an  8x8  kernel  of  con¬ 
stant  weights  around  the  rim  and  a  4X4  central  area  of  zero 
weights),  to  remove  or  suppress  the  microcalcifications  and 
smooth  the  noise.  Subtracting  the  two  filtered  images  results 
in  an  SNR-enhanced  image  in  which  the  low-frequency 
structured  background  is  removed  and  the  high-frequency 
noise  is  suppressed.  This  is  also  referred  to  as  a  difference- 
image  technique.^^’^^’^^’^^  When  both  the  signal-enhancement 
filter  and  the  signal-suppression  filter  are  linear,  as  used  in 
this  study,  the  difference-image  technique  is  equivalent  to 
bandpass  filtering.  In  the  segmentation  step,  the  program  de¬ 
termines  the  gray  level  histogram  of  the  preprocessed  image 
within  the  breast  region.  A  gray  level  thresholding  technique 
is  used  to  locate  potential  signal  sites  above  a  global  thresh¬ 
old.  The  threshold  is  changed  iteratively  until  the  number  of 
sites  obtained  falls  within  the  chosen  input  maximum  (4000) 
and  minimum  (3000)  numbers.  At  each  potential  site,  a  lo¬ 
cally  adaptive  gray  level  thresholding  technique  in  combina¬ 
tion  with  region  growing  is  performed  to  determine  the  num¬ 
ber  of  connected  pixels  above  a  local  threshold,  which  is 
calculated  as  the  product  of  the  local  RMS  noise  and  an  input 
SNR  threshold.  The  signal  characteristics  to  be  used  in  the 
classification  step,  such  as  the  size,  maximum  contrast,  SNR, 
and  its  location,  are  obtained  in  this  step.  This  locally  adap¬ 
tive  thresholding  technique  is  similar  to  the  signal  character¬ 
istic  extraction  technique  described  above,  except  that  the 
procedure  is  performed  on  the  SNR-enhanced  image  instead 
of  the  unprocessed  image  so  that  no  curve  fitting  for  back¬ 
ground  correction  is  necessary. 

In  the  classification  step,  the  previous  computer  program 
performs  three  tests  to  distinguish  signals  from  noise  or  arti¬ 
facts.  A  lower  bound  (two  pixels)  is  imposed  on  the  size  to 
exclude  signals  below  a  certain  size  that  are  likely  to  be 
noise  and  an  upper  bound  (80  pixels)  is  set  to  exclude  signals 
greater  than  a  certain  size  that  are  likely  to  be  large  benign 
calcifications.  A  contrast  upper  bound  is  also  set  to  exclude 
potential  signals  that  have  a  contrast  higher  than  an  input 
number  (10)  of  SDs  above  the  average  contrast  of  all  poten¬ 
tial  signals  found  with  local  thresholding.  This  criterion  ex¬ 
cludes  the  very  high-contrast  signals  that  are  likely  to  be 
artifacts  and  large  benign  calcifications.  A  regional  clustering 
procedure  is  then  applied  to  the  remaining  signals;  a  signal  is 
kept  if  the  number  of  signals  found  within  a  neighborhood  of 
a  chosen  input  diameter  (1  cm)  around  that  signal  is  greater 


HRST  SECOND 

HIDDEN  LAYER  HIDDEN  LAYER 


Fig.  1.  Schematic  diagram  of  the  architecture  of  a  convolution  neural  net¬ 
work  (CNN).  The  input  ROI  size,  the  number  of  hidden  layers,  and  the 
number  of  node  groups  in  each  layer  are  varied  in  this  study. 

than  an  input  minimum  number.  The  remaining  signals  that 
are  not  found  to  be  in  the  neighborhood  of  any  potential 
clusters  will  be  considered  isolated  noise  points  or  calcifica¬ 
tions  and  excluded.  This  clustering  criterion  is  useful  for 
reducing  false  positives,  because  true  microcalcifications  of 
cUnical  interest  always  appear  in  clusters  on 
mammograms.^^'^^  The  specific  parameters  used  in  each  step 
have  been  described  previously. 

In  this  study,  we  investigated  the  effectiveness  of  a  trained 
convolution  neural  network  (CNN)^"^  in  discriminating  false 
signals  from  true  microcalcifications.  The  chosen  CNN  clas¬ 
sifier  was  incorporated  in  the  detection  program.  The  poten¬ 
tial  signals  that  passed  the  size  and  contrast  tests  in  the  clas¬ 
sification  step  were  further  screened  by  the  CNN  before 
being  examined  by  the  regional  clustering  criterion.  The 
overall  detection  accuracy  of  microcalcifications  with  and 
without  the  CNN  classifier  could  then  be  compared. 

E.  Convolution  neural  network  classifier 

The  artificial  neural  network  (ANN)  used  in  this  application 
is  a  convolution-type  neural  network.^"^  The  CNN  can  be 
considered  a  simplified  version  of  the  neocognitron^^  de¬ 
signed  to  simulate  the  human  visual  system.  The  general 
architecture  of  the  CNN  used  in  this  study  is  shown  in  Fig.  1 . 
It  consists  of  an  input  layer,  one  to  several  hidden  layers,  and 
an  output  layer.  The  input  layer  of  the  CNN  contains  NXN 
input  nodes,  each  of  the  input  nodes  is  a  sensor  for  an  input 
pixel  value  in  an  A^XTV-pixel  ROI  containing  the  normal  or 
abnormal  pattern  to  be  recognized.  In  the  hidden  layers,  the 
nodes  are  organized  in  groups  and  the  groups  between  adja¬ 
cent  layers  are  interconnected  by  weights  that  are  organized 
in  kernels.  Learning  is  constrained  such  that  the  kernel  of 
weights  connecting  the  ^th  group  in  the  (L  — l)th  layer  to  the 
nth  group  in  the  Lth  layer  is  invariant  with  nodes  in  the  same 
groups.  Forward  signal  propagation  is  thus  similar  to  a  spa¬ 
tially  invariant  convolution  operation;  the  signals  from  the 
nodes  in  the  lower  layer  are  convolved  with  the  weight  ker¬ 
nel,  and  the  resultant  value  of  the  convolution  is  collected 
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Fig.  2.  The  two  groups  of  ROIs  with  true  microcalcifications  and  false  positives  used  for  training  of  the  CNNs  in  this  study.  Each  of  the  ROI  shown  here 
contains  16X16  pixels  (1.6  mm X  1.6  mm),  (a)  ROIs  in  group  1  with  true  microcalcifications.  (b)  ROIs  in  group  1  with  false  positives,  (c)  ROIs  in  group  2 
with  true  microcalcifications.  (d)  ROIs  in  group  2  with  false  positives. 
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FIG.  2  {Continued.) 


into  the  corresponding  node  in  the  upper  layer.  This  value  is 
further  processed  by  the  node  through  an  activation  function 
and  produces  an  output  signal  that  will,  in  turn,  be  forward 
propagated  to  the  subsequent  layer  in  a  similar  manner.  The 


convolution  kernel  incorporates  the  neighborhood  informa¬ 
tion  in  the  input  image  pattern  and  transfers  the  information 
to  the  receiving  layers,  thus  providing  the  pattern  recognition 
capability  of  the  CNN.  The  activation  function  between  two 
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layers  is  a  sigmoidal  function,  and  the  signa;  at  the  Lth  layer 
is  obtained  from  the  signal  at  the  (L-l)th  layer  using  the 
following  relationship: 

SLiUJ);n) 

1 

(1) 

where  Si((iJ);n)  denotes  the  signal  at  node  (ij)  in  the  nth 
group  and  Lth  layer,  Wi{{iJ);k^n)  denotes  the  weight  ker¬ 
nel  connecting  the  kth  group  in  the  (L-l)th  layer  to  the  nth 
group  in  the  Lth  layer,  *  denotes  the  convolution  operation, 
and  the  summation  is  over  all  groups  k  that  are  connected  to 
group  n.  Note  that  the  weight  kernel  for  a  given  /:  and  a 
given  n  is  shift  invariant,  such  that 

WiHi' ,j';i,j)-,k=>n)  =  wi{ii'  -i,j'  -j);k=^n),  (2) 

where  (i'J')  denotes  the  node  in  the  kth  group  and  the 
(L-l)th  layer.  Because  of  the  convolution  operation,  the 
useful  matrix  size  of  a  node  group  in  the  Lth  layer,  NiXM^, 
is  reduced  to  j-b  1), 

where  j  X  i  is  the  matrix  size  of  a  node  group  in  the 
(L-l)th  layer  and  is  the  size  of  a  weight  ker¬ 

nel  between  the  Lth  layer  and  the  (L-l)th  layer. 

In  the  output  layer,  there  are  individual  output  nodes. 
Each  output  node  is  fully  connected  to  all  nodes  in  each 
group  of  the  preceding  hidden  layer.  The  signal  at  the  nth 
output  node  is  given  by  Eq.  (1),  in  which  the  weight  matrix 
size  is  the  same  as  the  group  size  in  the  preceding  layer  and 
the  output  group  size  is  ixl. 


F.  Back-propagation  training 

The  error  back-propagation  learning  rule  is  used  for  su¬ 
pervised  training  of  the  CNN.  The  error  function  that  is  to  be 
minimized  by  training  is  given  by 

2  ^OUl 

Error=  j  X  [•5in(0-■5Lo(0]^  (3) 

where  is  the  input  (or  desired)  value  of  a  given  train¬ 
ing  case  at  the  zth  node  of  the  output  layer,  L^,  is  the 

network  output  signal  of  the  case  at  that  node,  and  ^0^  is  the 
number  of  nodes  in  the  output  layer. 

The  conventional  steepest  descent  delta  rule  for  back- 
propagation  training  of  a  CNN  can  be  written  as 

Wi({u,v)\k=>n)[t+  1] 

=  WL({u,v)-,k=i’n)[t]+ 

iJ 

XSi-ji{i  +  u,j+v);k),  (4) 

where  t  is  the  number  of  iterations,  77  is  the  learning  rate,  and 
4  is  the  weight-update  function  given  by 

(5) 

where 


Fig.  3.  Dependence  of  classification  accuracy,  >4, ,  on  the  number  of  itera¬ 
tions.  The  solid  curves  are  the  average  obtained  from  four  repeated  runs. 
The  two  dotted  curves  around  each  solid  curve  indicate  the  average ± one  SD 
of  estimated  from  the  repeated  runs.  CNN  configuration:  16X16  input 
nodes,  first  hidden  layer:  12  node  groups,  second  hidden  layer:  12  node 
groups,  each  connected  to  8  groups  in  the  first  hidden  layer,  two  output 
nodes,  weight  kernels  between  input  and  first  hidden  layer:  5X5,  weight 
kernels  between  first  and  second  hidden  layers:  3X3. 


Gl(('J):«)=  2  WL+ii{u,v}-,k=t’n) 

k=>n 

y-SL+i(ii-u,j-v)-,k=>n).  (6) 

At  the  output  layer,  the  weight  is  updated  as 
WLoi{iJ)\k=>n)[t+l]  =  Wi^({iJ);k=>n)[t] 

+  (7) 

where 

Training  may  be  terminated  at  a  selected  level  of  total  error, 
which  is  the  sum  of  the  error  for  an  individual  case  [Eq.  (3)] 
over  all  cases  in  the  training  set,  a  selected  level  of  classifi¬ 
cation  accuracy  (A^)  as  defined  below,  or  a  preset  number  of 
iterations.  In  this  study,  we  used  the  total  error  as  the  termi¬ 
nation  criterion.  The  total  error  allowed  at  termination  was 
chosen  to  be  low  enough  so  that  the  test  A^  could  reach  a 
plateau,  as  demonstrated  in  Fig.  3. 

In  our  application,  all  weights  in  the  CNN  were  initialized 
to  be  between  —0.5  to  +0.5  using  a  random  number  genera¬ 
tor  with  a  different  seed  in  each  training  run  and  normalized 
by  the  number  of  weights  in  the  exponential  factor  of  the 
sigmoidal  activation  function  [Eq.  (1)].  An  A^XA^-pixel  re¬ 
gion  centered  at  a  potential  site  that  passes  the  size  and  con¬ 
trast  tests  formed  the  input  ROI  to  the  CNN.  For  a  given 
input  SNR  threshold,  the  program  would  identify  a  number 
of  potential  signals.  A  low  SNR  threshold  corresponded  to  a 
lax  criterion  with  a  large  number  of  false-positive  (IT)  sig¬ 
nals.  A  high  SNR  threshold  corresponded  to  a  stringent  cri¬ 
terion  with  a  small  number  of  FP  signals  and  a  loss  in  true¬ 
positive  (TP)  signals.  For  training  the  CNN,  we  arbitrarily 
divided  the  38  mammograms  in  the  average  and  subtle 
groups  into  two  subgroups.  When  the  ROIs  obtained  from 
one  subgroup  were  used  for  training,  the  trained  CNN  would 
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‘  Table  I.  Physical  characteristics  of  microcalcifications  extracted  with  an  SNR  threshold  of  2.0  from  the  three  groups  of  unfiltered  mammograms. 


Image  group 

No.  of 
images 

No.  of 
/xcaic. 

Mean  no.  of 
yLtcalc/image 

Size  (pixels) 

Contrast  (pixel  value) 

SNR 

Mean 

Std.  dev. 

Mean 

Std.  dev. 

Mean 

Std.  dev. 

Ratings  1,2 

14 

213 

15 

12.3 

11.4 

183.4 

84.2 

5.8 

2.6 

Rating  3 

16 

162 

10 

13.4 

12.5 

164.6 

82.8 

5.5 

2.6 

Ratings  4,5 

22 

270 

12 

9.0 

9.4 

143.9 

87.0 

4.6 

2.2 

be  applied  to  the  second  subgroup  for  testing,  and  vice  versa. 
We  chose  one  of  the  SNR  thresholds  that  yielded  a  moderate 
number  of  FPs  and  a  sufficiently  large  number  of  TPs  for 
segmenting  the  training  ROIs.  Because  the  number  of  FPs 
were  still  a  few  times  more  than  the  number  of  TPs,  a  subset 
of  FPs  with  approximately  the  same  number  as  the  TPs  were 
randomly  chosen  for  the  training  set.  It  should  be  noted  that 
the  chosen  SNR  threshold  level  was  not  critical  as  long  as 
the  numbers  of  FP  and  TP  were  sufficiently  large  to  provide 
the  variety  of  ROI  patterns  for  training  the  CNN.  The  ROIs 
obtained  by  using  a  high  SNR  threshold  were  generally  a 
subset  of  those  obtained  by  using  a  low  SNR  threshold.  A 
chosen  ROI  input  to  the  CNN  was  obtained  from  the  SNR- 
enhanced  image.  The  gray  level  values  of  the  pixels  in  the 
ROI  were  thus  independent  of  the  SNR  threshold  at  which  it 
was  chosen,  and  all  ROIs  had  the  same  average  background 
pixel  value. 

The  shape  of  the  microcalcifications  in  the  breast  paren¬ 
chyma  could  be  considered  randomly  oriented  if  we  consid¬ 
ered  all  possible  locations  of  the  microcalcifications  in  the 
breast  and  all  mammographic  views.  To  increase  the  variabil¬ 
ity  of  the  training  group,  eight  input  ROIs  to  the  CNN  were 
generated  from  each  ROI  by  rotating  the  ROI  and  its  mirror 
image  to  0°,  90°,  180°,  and  270°.  Each  training  cycle  thus 
included  training  of  the  complete  set  of  training  ROIs  with 
the  eight  orientations.  The  input  order  of  the  training  ROIs 
was  randomized  with  a  different  random  number  sequence  in 
each  run.  A  test  ROI  would  be  rotated  also  in  the  eight  ori¬ 
entations,  and  the  average  output  value  of  the  eight  rotated 
ROIs  was  taken  to  be  the  output  value  of  that  test  ROI. 
During  training,  the  desired  output  of  an  ROI  with  microcal¬ 
cification  was  set  to  1  and  that  of  an  ROI  without  microcal¬ 
cification  was  set  to  0. 

We  investigated  the  dependence  of  the  classification  accu¬ 
racy  of  positive  and  negative  ROIs  on  the  CNN  configura¬ 
tions.  Because  of  the  computational  requirements  in  training 
the  CNNs,  we  did  not  exhaustively  study  every  possible 
combination  of  parameters.  The  range  of  parameters  that  we 
studied  and  the  corresponding  results  are  tabulated  in  Table 


Table  II.  Number  of  ROIs  with  microcalcifications  and  false  positives  for 
training  of  the  CNN. 


Number  of  ROIs 

Group  1  (Gl) 

Group  2  (G2) 

with  rotation 

with  rotation 

Microcalcifications 

no 

880 

108 

864 

False  positives 

116 

928 

116 

928 

III.  CNNs  with  one  and  two  hidden  layers  were  examined. 
The  number  of  node  groups  in  the  hidden  layers  was  varied 
from  4  to  12.  In  most  of  the  two-hidden-layer  CNNs,  the 
number  of  groups  was  kept  the  same  for  both  layers.  Com¬ 
binations  of  12  groups  in  the  first  hidden  layer  and  4,  8,  or  12 
groups  in  the  second  hidden  layer  were  also  studied.  All  node 
groups  in  the  two  hidden  layers  are  fully  connected  in  these 
configurations.  Additionally,  a  12  group- 12  group  combina¬ 
tion  in  which  every  3  of  the  12  groups  in  the  second  hidden 
layer  were  connected  to  the  same  8  selected  groups  in  the 
first  hidden  layer  was  examined.^"^  For  comparison,  a  CNN 
with  8  groups  in  the  first  hidden  layer  and  12  groups  in  the 
second  hidden  layer  with  full  connections  was  included.  We 
also  evaluated  the  classification  accuracy  for  two  combina¬ 
tions  of  weight  kernel  sizes;  one  had  a  kernel  size  of  5X5  in 
the  first  hidden  layer  and  3X3  in  the  second  hidden  layer  and 
the  other  had  a  kernel  size  of  7X7  in  the  first  hidden  layer 
and  5X5  in  the  second  hidden  layer.  We  did  not  investigate 
larger  kernel  sizes  because  the  sizes  of  the  microcalcifica¬ 
tions  of  interest  were  generally  much  smaller  than  7X7  pix¬ 
els  and  because  computation  time  increased  rapidly  with  ker¬ 
nel  size.  The  input  ROI  size  was  adjusted  so  that  the  size  of 
the  node  groups  in  the  last  hidden  layer  was  lOX  10  for  both 
combinations  of  kernel  sizes.  The  output  nodes  were  always 
fully  connected  to  every  node  group  in  the  last  hidden  layer 
with  a  10X10  kernel,  as  shown  in  Fig.  1. 

The  classification  accuracy  of  the  CNN  during  training 
was  monitored  by  receiver  operating  characteristic  (ROC) 
analysis^^  of  the  output  values  from  the  CNN.  After  each 
iteration,  or  epoch,  with  the  training  set  was  completed,  the 
classification  performance  with  the  current  weights  for  all 
training  cases  would  be  determined  by  inputting  the  training 
cases  into  the  CNN  as  a  consistency  verification  procedure. 
The  distributions  of  the  output  values  for  the  positive  ROIs 
and  the  negative  ROIs  would  be  input  into  the  LABRCXri 
program,^^  which  assumes  binormal  distributions  of  the  de¬ 
cision  variable  for  the  normal  and  abnormal  cases  and  fits  an 
ROC  curve  based  on  maximum  likelihood  estimation.  The 
ROC  curve  represents  the  relationship  between  the  true¬ 
positive  fraction  (TPF)  and  the  false-positive  fraction  (FPF) 
as  the  decision  threshold  varies.  The  LABROCl  program  pro¬ 
vides  the  area  under  the  fitted  ROC  curve,  ,  and  an  esti¬ 
mate  of  the  SD  of  A^ .  A^  is  used  as  an  index  of  classification 
accuracy.  The  dependence  of  A^  on  the  number  of  iterations 
was  monitored  during  training.  For  every  ten  training  itera¬ 
tions,  the  trained  CNN  was  applied  to  the  other  independent 
set  of  ROIs  to  test  its  generalization  capability.  The  depen¬ 
dence  of  the  test  A^  on  the  number  of  iterations  was  also 
examined. 
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Table  III.  Dependence  of  lest  results,  in  terms  of  the  area  under  the  ROC  curve  {/t.),  on  the  configuration  of  CNN  for  classification  of  microcalcifications. 


Input  ROI  size  (pixels) 

Kernel  size  (first  hidden  layer) 

Kernel  size  (second  hidden  layer) 

16X16 

5X5 

?X3 

20X20 

7X7 

.5X5 

.No.  of  groups  in  hidden  layer 

No.  of  output  nodes 

No.  of  output  nodes 

First 

Second 

1 

2 

1 

2 

2 

A. 

4, 

4, 

4, 

Train:  G1 

Train:  G1 

Train:  G2 

Train:  G1 

Train:  G1 

Train:  G2 

Test:  G2 

Test:  G2 

Test:  G1 

Test:  G2 

Test:  G2 

Test:  G1 

4 

4 

0.86 

0.86 

0.86 

0.90 

0.87 

6 

6 

0.88 

0.88 

0.86 

0.89 

0.90 

0.89 

8 

8 

0.88 

0.88 

0.88 

0.89 

0.90 

0.89 

10 

10 

0.89 

0.89 

0.88 

0.91 

0.90 

0.89 

12 

12 

.  .  .3 

0.91 

0.89 

0.91 

0.89 

12 

4 

0.88 

0.86 

0.90 

0.90 

12 

8 

0.89 

0.90 

0.89 

0.90 

0.90 

0.90 

8 

12 

0.89 

0.89 

0.88 

0.90 

12(8)^ 

12 

0.90 

0.90 

0.90 

0.91 

0.90 

0.89 

One  hidden  layer 

4 

0.87 

0.85 

0.83 

0.85 

8 

0.85 

0.87 

0.86 

0.85 

12 

0.87 

0.86 

0.86 

0.86 

^The  CNN  configuration  was  not  tested  if  there  is  no  entry. 

^Eight  node  groups  in  the  first  hidden  layer  are  selectively  connected  to  the  12  node  groups  in  the  second  hidden  layer.  The  A.  values  for  this  CNN  are  the 
averages  of  four  runs  shown  in  Table  IV. 


G.  Analysis  of  detection  accuracy 

After  passing  the  size  and  contrast  criteria,  being  screened  by 
the  trained  CNN,  and  passing  the  regional  clustering  crite¬ 
rion,  the  detected  individual  microcalcifications  and  clusters 
would  be  compared  with  the  “truth”  file  of  the  input  image. 
The  number  of  TP  and  FP  microcalcifications  and  the  num¬ 
ber  of  TP  and  FP  clusters  were  scored.  A  detected  signal  was 
scored  as  a  TP  microcalcification  if  it  was  within  0.5  mm 
from  a  true  microcalcification  in  the  “truth”  file.  A  detected 
cluster  was  scored  as  a  TP  if  its  centroid  coordinate  was 
within  a  cluster  radius  (5  mm)  from  the  centroid  of  a  true 
cluster  and  at  least  two  of  its  member  microcalcifications 
were  scored  as  TR  Once  a  true  microcalcification  or  cluster 
was  matched  to  a  detected  microcalcification  or  cluster,  it 
would  be  eliminated  from  further  matching.  Any  detected 
microcalcifications  or  clusters  that  did  not  match  to  a  true 
microcalcification  or  cluster  were  scored  as  FPs.  The  tradeoff 
between  the  TP  and  FP  detection  rates  by  the  computer  pro¬ 
gram  was  evaluated  by  the  free-response  receiver  operating 
characteristic  (FROG)  analysis"^^  by  varying  the  input  SNR 
threshold.  A  low  SNR  threshold  corresponded  to  a  lax  crite¬ 
rion  with  a  large  number  of  FP  clusters.  A  high  SNR  thresh¬ 
old  corresponded  to  a  stringent  criterion  with  a  small  number 
of  FP  clusters  and  a  loss  in  TP  clusters.  The  detection  accu¬ 
racy  of  the  computer  program  with  and  without  the  CNN 
classifier  could  then  be  assessed  by  comparison  of  the  FROG 
curves. 

III.  RESULTS 

Using  the  signal  extraction  program  described  in  Sec.  II.  the 
size,  contrast,  and  SNR  of  the  true  microcalcifications  as 


indicated  in  the  “truth”  file  for  each  of  the  three  groups  of 
mammograms  were  determined  at  several  SNR  thresholds. 
We  examined  the  extracted  signals  in  the  thresholded  images 
and  compared  visually  the  extracted  signals  with  those  in  the 
original  images.  When  the  SNR  threshold  was  too  low,  the 
signals  merged  with  one  another  or  with  noise  in  the  back¬ 
ground.  The  extracted  signals  did  not  represent  the  true  sig¬ 
nal  size  or  shape.  When  the  SNR  threshold  was  too  high, 
many  subtle  microcalcifications  were  not  extracted.  The  ex¬ 
tracted  signals  appeared  to  be  smaller  than  those  in  the  origi¬ 
nal  images  because  only  a  few  pixels  in  a  microcalcification 
were  higher  than  the  threshold.  It  was  determined  subjec¬ 
tively  that  an  SNR  threshold  of  2.0  was  a  compromise  with 
which  the  extracted  signals  were  similar  in  size  and  shape  to 
those  in  the  original  images.  At  this  SNR  threshold,  an  aver¬ 
age  of  about  85%  of  the  microcalcifications  were  extracted. 
The  other  15%  of  the  microcalcifications  could  not  be  ex¬ 
tracted  at  this  threshold  because  their  pixel  values  were 
lower  than  the  local  gray  level  threshold. 

Table  I  shows  the  mean  and  SD  of  the  contrast,  size,  SNR 
of  the  microcalcifications  extracted  at  an  SNR  threshold  of 
2.0  for  each  of  the  three  groups.  Note  that  the  “size”  of  an 
extracted  microcalcification  depends  on  the  SNR  threshold 
used  because  it  may  merge  with  an  adjacent  noise  or  signal 
pixels,  as  discussed  previously.  The  contrast  is  relatively  in¬ 
dependent  of  the  SNR  threshold,  since  it  depends  only  on  the 
maximum  pixel  value  in  the  signal  region.  We  have  plotted 
the  histograms  of  the  contrast,  size,  and  SNR  of  the  extracted 
microcalcifications  and  found  a  large  overlap  in  the  physical 
characteristics  of  the  microcalcifications  in  the  three  groups 
of  mammograms.  As  can  be  seen  in  Table  I.  the  visual  rank- 
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,ing  generally  correlates  with  the  mean  contrast  and  mean 
SNR  of  the  microcaicifications.  Howeven  the  mean  number 
of  microcaicifications  in  the  subtle  group  is  larger  than  that 
of  the  average  group.  These  observations  indicate  that  the 
visibility  of  a  microcalcification  cluster  is  more  strongly  af¬ 
fected  by  the  contrast  and  SNR  than  by  the  number  of  mi¬ 
crocaicifications  in  the  cluster.  This  is  consistent  with  the 
experience  of  radiologists  in  visual  detection  of  microcaici¬ 
fications.  The  data  in  Table  I  should  provide  more  objective 
information  than  the  visibility  ratings  in  the  description  of 
the  degree  of  subtlety  for  each  group  of  microcaicifications. 
The  quantitative  characterization  can  facilitate  comparison  of 
the  performance  of  CAD  algorithms  in  different  datasets  if  a 
similar  signal  extraction  method  and  criteria  are  used  in  cal¬ 
culation  of  the  data. 

Table  II  shows  the  number  of  ROIs  with  true  microcaici¬ 
fications  and  false  signals  used  for  training  of  the  CNN.  Each 
group  of  ROIs  was  detected  with  the  automated  algorithm  at 
an  SNR  threshold  of  3.4  from  19  SNR-enhanced  images.  At 
this  threshold,  the  average  TP  rate  was  94%  at  an  average  FP 
rate  of  7.5  clusters  per  image.  This  point  was  outside  the 
range  of  the  FP  rates  plotted  in  Fig.  7.  There  were  no  over¬ 
lapping  cases  in  the  two  groups.  The  extracted  ROIs  of 
16X16  pixels  are  displayed  in  Figs.  2(a)-2fd).  It  can  be  seen 
that  a  large  number  of  the  FPs  extracted  by  the  CAD  pro¬ 
gram  was  caused  by  high-frequency  structures  such  as  fi¬ 
brous  strands,  film  artifacts,  and  noise.  Only  one-fifth  of  the 
FP  ROIs  were  included  in  the  training  groups  in  order  to 
match  approximately  the  number  of  ROIs  with  true  micro¬ 
caicifications.  With  the  rotation  method,  over  800  positive 
and  over  900  negative  ROIs  were  generated  in  each  training 
set.  When  one  group  was  used  for  training,  the  displayed 
ROIs  in  the  other  group,  together  with  the  other  four-fifths  of 
the  FP  ROIs  from  the  same  set  of  images,  were  used  as  the 
test  set.  The  signal  of  interest  was  centered  at  the  ROI.  The 
average  background  gray  level  was  the  same  for  all  ROIs 
after  the  SNR-enhancement  filtering. 

The  dependence  of  the  classification  accuracy  on  CNN 
configuration  and  training  set  is  shown  in  Table  III.  The  SDs 
of  the  A.  as  determined  by  the  LABROCi  program  ranged 
from  0.01  to  0.02.  The  classification  accuracy  during  training 
generally  reached  an  A.  of  0.99  or  greater  under  all  condi¬ 
tions  studied.  The  test  results  exhibited  some  variations,  as^ 
can  be  seen  from  Table  III.  The  CNNs  with  one  hidden  layer 
are  inferior  to  the  CNNs  with  two  hidden  layers.  The  perfor¬ 
mance  of  the  CNNs  with  two  hidden  layers  does  not  depend 
strongly  on  the  configuration  when  the  total  number  of 
weights  in  the  CNN  is  large.  There  is  a  slight  trend,  with 
some  minor  variations,  that  the  A  -  increases  as  the  number  of 
node  groups  increases.  This  trend  is  more  systematic  for  the 
CNNs  with  small  weight  kernels.  There  is  also  a  trend  that, 
for  the  same  CNN  configuration,  the  test  A,  is  larger  when 
G1  is  used  for  training  than  when  G2  is  used.  The  difference 
in  the  test  A.  values  between  the  two  training/test  group 
combinations,  averaged  over  all  two-hidden-layer,  two- 
output-node  CNNs  studied,  is  only  about  0,01.  This  differ¬ 
ence,  however,  is  statistically  significant  at  a  two-tailed  p 
level  of  0.005. 

We  also  compared  the  difference  in  performance  between 


Table  IV.  Reproducibility  of  test  results  for  two  CNNs.  The  A.  shown  is  the 
maximum  value  reached  for  a  given  run. 


Input  ROI  size  (pixels) 

16X16 

20X20 

Kernel  size 

First  hidden  layer: 

5X5 

7X7 

Second  hidden  layer: 

3X3 

5X5 

No.  of  groups 

First  hidden  layer: 

12(8) 

8 

Second  hidden  layer: 

12 

8 

Train:  Gi  Train:  G2 

Train:  GI  Train:  G2 

Repeated  run 

Test:  G2  Test:  Gl 

A- 

Test:  G2  Test:  Gl 
A, 

I 

0.91 

0.91 

0.91  0.90 

2 

0.90 

0.88 

0.90  0.88 

3 

0.89 

0.89 

0.90  0.89 

4 

0.90 

0.91 

0.90  0.88 

Mean 

0.90 

0.90 

0.90  0.89 

std.  dev. 

0.01 

0.01 

0.01  0.01 

CNN  with  one-  and  two-output  nodes.  The  two  output  values 
from  the  two-output  CNNs  were  found  to  be  complementary 
to  each  other,  i.e.,  for  a  given  case,  if  the  output  of  one  node 
was  jc,  the  output  from  the  other  node  was  very  close  to 
(l“x).  This  outcome  is  expected  because  the  desired  output 
values  for  the  true  and  false  signals  were  set  to  be  1  and  0, 
respectively,  as  described  above.  Therefore,  the  output  from 
one  node  was  sufficient  for  the  classification  task,  and  the 
ROC  curve  could  be  constructed  from  either  of  the  output 
nodes.  The  difference  in  the  A,  values  between  the  one-  and 
two-output  configurations,  averaged  over  the  two-hidden- 
layer  CNNs,  is  0.001.  The  difference  is  not  statistically  sig¬ 
nificant  (/?=0.68), 

To  study  the  variability  in  the  classification  accuracy  due 
to  the  initialization  condition  of  the  weights  and  training,  the 
training  and  testing  of  two  selected  CNN  configurations  were 
repeated  four  times  for  each  of  the  two  training/test  group 
combinations.  The  CNN  configurations  and  the  results  are 
listed  in  Table  IV.  The  SD  of  A,  is  estimated  from  the  re¬ 
peated  runs  to  be  0.01  in  each  case,  and  the  maximum  dif¬ 
ference  in  A-  for  the  four  runs  is  0.03.  For  a  given  CNN 
configuration,  the  mean  A-  values  for  the  two  training/test 
combinations  agree  within  0,01.  Figure  3  shows  the  depen¬ 
dence  of  A.  for  training  and  testing,  averaged  over  four  runs, 
on  the  number  of  iterations  for  one  of  the  CNNs.  The  SDs  of 
A.  estimated  from  the  repeated  runs  are  also  plotted  for  the 
training  and  the  test  curves.  Both  the  mean  A.  values  and 
SDs  stabilize  after  some  large  fluctuations  in  the  initial  itera¬ 
tions.  The  shapes  of  the  A,  curves  are  typical  of  the  condi¬ 
tions  included  in  this  study,  although  the  rate  of  convergence 
varies  with  CNN  configurations.  The  curves  increase  rapidly 
initially  then  plateau  off  and  gradually  approach  its  maxi¬ 
mum  level.  The  convergence  of  the  CNN  training  can  also  be 
observed  from  the  dependence  of  the  total  error  on  the  num¬ 
ber  of  iterations,  as  shown  in  Fig.  4.  The  magnitude  of  the 
error  depends  on  the  number  of  output  nodes  and  the  number 
of  input  cases.  However,  the  trend  of  the  curve  is  typical 
among  the  CNNs  studied.  It  shows  a  steep  descent  initially 


Medical  Physics,  Vol.  22,  No.  10,  October  1995 


1564 


1564  Chan  et  at.:  Recognit  cn  of  microcalcifications  with  a  neural  netv  ork 


Fig.  4.  Dependence  of  total  error  of  the  CNN  output  on  the  number  of 
iterations.  The  CNN  configuration  is  the  same  as  that  in  Fig.  3.  Training 
group  G2  was  used. 

then  gradually  levels  off  at  large  number  of  iterations. 

The  training  of  each  CNN  with  each  group  of  training 
cases  produces  a  set  of  weights  at  each  iteration.  Many  of  the 
CNN  configurations  reach  approximately  the  same  level  of 
performance  (Table  III)  and  may  be  used  as  a  classifier  in  the 
microcalcification  detection  program.  We  selected  one  of  the 
trained  CNNs  shown  in  Table  IV  (2  hidden  layers,  each  with 
12  node  groups,  every  3  node  groups  in  the  second  hidden 
layer  selectively  connected  to  8  node  groups  in  the  first  hid¬ 
den  layer,  weight  kernels  sizes  of  5X5  and  3X3)  to  demon¬ 
strate  the  effect  of  the  CNN  classifier  on  detection  accuracy. 
A  weight  set  trained  with  the  G1  group  was  used  to  test  the 
classification  accuracy  for  the  G2  group  and  another  set 
trained  with  the  G2  group  was  used  to  test  the  classification 
accuracy  for  the  G 1  group.  The  weights  were  obtained  from 
one  of  the  iterations  when  the  plateau  of  A ,  was  reached.  The 
ROC  curves  for  classification  of  the  test  groups  of  ROIs 


Fig.  5.  The  ROC  curves  obtained  with  the  test  ROI  groups.  The  CNN 
configuration  is  the  same  as  that  in  Fig.  3. 


using  the  trained  CNNs  are  shown  in  Fig.  5.  The  A.  values  of 
the  curves  are  0.91  and  0.90,  which  correspond  to  the  best 
performance  obtained  with  the  CNNs  tabulated  in  Table  III. 

We  incorporated  the  trained  CNN  into  our  microcalcifica¬ 
tion  detection  program  as  described  previously,  and  the  over¬ 
all  improvement  in  the  detection  accuracy  was  evaluated.  For 
any  SNR  threshold,  each  extracted  signal  that  passed  the  size 
and  contrast  criteria  was  input  into  the  CNN.  The  set  of 
weights  obtained  from  training  with  G1  was  used  for  the  19 
mammograms  from  which  the  G2  ROIs  were  extracted,  and 
vice  versa.  For  the  group  of  obvious  mammograms,  either  set 
of  weights  could  be  used  because  the  obvious  cases  were  not 
used  for  training  or  testing.  The  performance  of  the  trained 
CNNs  on  the  obvious  cases  was  thus  an  additional  indepen¬ 
dent  test  for  the  classifiers.  In  this  application,  a  constant 
decision  threshold  was  set  for  the  CNN  output  value  of  any 
input  ROI  to  determine  if  the  ROI  was  normal  or  abnormal. 
To  select  the  appropriate  decision  threshold  for  the  output 
value  from  the  CNN,  the  dependence  of  the  FROC  curve  on 
the  decision  threshold  was  evaluated.  This  corresponded  to 
varying  the  operating  point  along  the  ROC  curve  (Fig.  5)  of 
the  classifier.  The  FROC  curves  for  the  three  sets  of  mam¬ 
mograms  were  plotted  in  Figs.  6(a)~6(c).  The  data  points 
along  each  FROC  curve  were  obtained  by  varying  the  SNR 
thresholds  from  3.0  to  5.2.  Some  of  the  data  points  were  not 
plotted  if  they  were  outside  the  range  of  the  graph.  A  curve 
without  the  CNN  (decision  threshold=0),  and  two  curves 
with  CNN  at  decision  thresholds  of  0.5  and  0.8,  respectively, 
were  plotted.  For  a  given  TP  rate,  the  number  of  FP  clusters 
decreased  as  the  CNN  threshold  increased  from  0.1  to  0.8. 
When  the  CNN  threshold  was  further  increased  to  0.9,  we 
observed  a  decrease  in  the  TP  rate  for  a  given  FP  rate  for 
subtle  cases,  indicating  that  many  of  the  ROIs  with  subtle 
microcalcifications  were  misclassified  with  the  high  CNN 
threshold.  At  a  CNN  threshold  of  0.8,  the  TP  rate  was  100% 
at  an  FP  rate  of  less  than  0.1  cluster  per  image  for  the  obvi¬ 
ous  cases.  For  the  cases  that  were  ranked  average  subtle  by 
radiologists,  the  TP  rate  was  about  93%  at  an  FP  rate  of  one 
cluster  per  image.  For  the  subtle  cases,  the  TP  rate  was  87% 
at  an  FP  rate  of  1 .5  clusters  per  image. 

The  degree  of  subtlety  of  the  clustered  microcalcifications ^ 
in  the  cases  ranked  from  3  to  5  is  similar  to  that  of  the  cases 
used  in  our  previous  observer  performance  study.  The  like¬ 
lihood  that  these  microcalcifications  may  be  missed  is  not 
negligible,  and  thus  it  is  of  particular  interest  for  CAD  ap¬ 
plications.  The  average  improvement  in  the  detection  accu¬ 
racy  for  these  cases  is  estimated  by  comparison  of  the  PROC 
curves  without  and  with  the  CNN  classifier  for  all  cases 
ranked  3-5.  The  FROC  curves  are  shown  in  Fig.  7.  The  TP 
rate  improves  from  about  87%  at  an  FP  rate  of  4  clusters  per 
image  without  the  CNN  classifier  to  90%  at  an  FP  rate  of 
about  1.5  clusters  per  image,  with  the  CNN  classifier  at  a 
decision  threshold  of  0.8. 

IV.  DISCUSSION 

The  computational  cost  for  training  a  CNN  is  high.  The 
computational  cost  per  iteration  increases  as  the  numbers  of 
nodes  and  weights  increase.  However,  it  was  observed  that 
the  rate  of  convergence  increased  as  the  number  of  nodes 
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Fig.  6.  Comparison  of  FROC  curves  for  detection  of  clustered  microcaici- 
fications  without  and  with  the  CNN  classifier.  The  curve  without  CNN  is 
equivalent  to  that  with  the  decision  threshold  of  the  CNN  set  to  0.  The 
FROC  curves  with  the  decision  threshold  of  the  CNN  set  to  0.5  and  0.8  are 
plotted  for  comparison,  (a)  Mammograms  with  obvious  microcalcifications, 
(b)  Mammograms  with  average  subtle  microcalc ificalions.  (c)  Mammo¬ 
grams  with  subtle  microcalcifications.  The  CNN  configuration  is  the  same  as 
that  in  Fig.  3. 


increased.  For  example,  for  CNNs  of  the  same  configuration, 
except  for  a  difference  in  the  number  of  output  nodes,  the 
two-output-node  CNN  reached  the  maximum  A,  with  a 
smaller  number  of  iterations  than  the  corresponding  one- 


Fig.  7.  Comparison  of  FROC  curves  for  detection  of  clustered  microcalci¬ 
fications  without  and  with  the  CNN  classifier.  The  overall  detection  accu¬ 
racy  for  the  average  and  subtle  groups  of  microcalcifications  are  compared. 
The  CNN  configuration  is  the  same  as  that  in  Fig.  3. 


output-node  CNN.  This  trend  is  more  obvious  for  the  CNNs 
with  fewer  node  groups  in  the  hidden  layers.  Similarly,  the 
convergence  rate  increases  until  the  number  of  node  groups 
in  the  hidden  layers  increases  to  about  10  for  CNNs  with 
two-output  nodes. 

The  convergence  rate  saturates  sooner  for  the  CNNs  with 
a  larger  kernel  size.  Therefore,  the  overall  training  cost  of 
CNNs  with  complicated  configurations  may  not  be  higher 
than  those  with  simpler  configurations.  We  could  not  perform 
an  exact  comparison  of  the  computation  time  for  different 
CNN  configurations  because  we  had  to  make  use  of  all  avail¬ 
able  workstations  that  had  different  CPU  speeds  and  differ¬ 
ent  memory  capacities  to  train  the  CNNs.  It  may  be  noted 
that  the  computational  cost  with  a  complicated  CNN  con¬ 
figuration  is  higher  than  that  of  a  simple  one  when  it  is 
incorporated  in  the  microcalcification  detection  program  for 
classification  of  test  cases. 

We  have  attempted  to  apply  the  FROCFIT  curve  fitting 
program*^ ^  to  the  FROC  curves  in  this  study,  but  failed  to 
obtain  well-fitted  curv'es.  This  may  be  caused  by  the  fact  that 
the  FROCHT  was  developed  on  the  basis  of  several  assump¬ 
tions,  which  may  not  be  satisfied  for  our  detection  task.^^  We 
therefore  could  not  arrive  at  a  single  value  such  as  the  A 
as  the  performance  index  for  comparison  of  the  different 
conditions.  The  generalization  capability  of  the  CNN  can  be 
observed  from  the  effectiveness  of  the  trained  CNN  in  reduc¬ 
ing  FPs  in  the  additional  independent  test  group  of  mammo¬ 
grams  of  ratings  1  and  2  [Fig.  6(a)].  At  a  CNN  threshold  of 
0.8,  the  FP  clusters  were  reduced  to  zero  for  almost  all  TP 
rates  below  100%.  Because  there  is  no  established  method  to 
test  the  statistical  significance  of  the  difference  in  two  FROC 
curves,  we  performed  t  tests  on  the  image-specific  paired  FP 
values  between  the  without-CNN  and  with-CNN  (threshold 
=0.8)  results  at  corresponding  TP  rates,  in  an  effort  to  esti¬ 
mate  the  significance  of  their  differences.  The  p  values 
ranged  from  0.04  to  0.08.  Although  the  improvement  in  the 
FP  rates  was  very  consistent  over  the  entire  range  of  TP 
rates,  as  shown  in  Fig.  6(a),  the  level  of  significance  for 
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individual  TP  rates  was  not  high,  probably  because  the  FP 
rates  without  CNN  were  already  very  low. 

The  performance  of  the  trained  CNN  can  also  be  observed 
from  the  effective  reduction  of  FPs  at  different  SNR  thresh¬ 
olds  in  the  test  group  of  mammograms  of  ratings  3-5,  As 
shown  in  Fig.  7,  for  a  given  TP  rate,  the  CNN  reduced  the  FP 
clusters  by  more  than  70%  with  a  CNN  threshold  of  0.8.  We 
again  performed  t  tests  on  the  image-specific  paired  FP  val¬ 
ues  at  corresponding  TP  rates.  For  TP  rates  between  about 
20%-75%  the  p  values  of  the  differences  between  the  FP 
rates  without  CNN  and  with  CNN  (threshold =0.8)  ranged 
from  0.06  to  0.0002.  We  also  performed  t  tests  on  the  paired 
TP  rates  at  corresponding  FP  rates.  For  FP  rates  between 
about  0. 1  to  about  0.7  clusters  per  image,  all  p  values  of  the 
differences  between  the  TP  rates  without  CNN  and  with 
CNN  (threshold=0.8)  were  less  than  0.001. 

The  FROC  curv'es  presented  here  were  obtained  by  vary¬ 
ing  the  SNR  threshold  in  the  local  gray  level  thresholding 
process.  The  CNN  classifier  was  implemented  so  that  a  con¬ 
stant  decision  threshold  for  its  output  value  was  used  to  clas¬ 
sify  ROIs  with  and  without  microcalcifications  obtained  at 
any  SNR  threshold.  Alternatively,  we  can  select  a  relatively 
low  SNR  threshold  that  produces  a  large  number  of  FPs  and 
vary  the  decision  threshold  for  the  output  of  the  CNN  clas¬ 
sifier,  thereby  generating  pairs  of  TP  and  corresponding  FP 
values  along  an  FROC  curve.  We  have  studied  this  approach 
by  using  SNR  thresholds  from  3.0  to  5.2,  from  each  of  which 
an  FROC  curve  was  generated  by  varying  the  CNN  threshold 
from  0.1  to  0.9.  It  was  observed  that  the  FROC  curves  ob¬ 
tained  with  this  alternative  method  were  lower  than  the 
FROC  curve  with  CNN  (threshold =0.8)  plotted  in  Fig.  7.  On 
each  of  these  alternative  FROC  curves,  the  data  point  at  a 
CNN  threshold  of  0.8  coincided  with  the  data  point  on  the 
FROC  curve  with  CNN  (threshold=0.8)  shown  in  Fig.  7, 
because  they  are  the  data  points  with  the  same  SNR  and 
CNN  thresholds.  Other  data  points  on  the  alternative  FROC 
curves  are  either  comparable  to  or  lower  than  the  FROC 
curve  in  Fig.  7,  with  a  few  exceptions  in  the  range  of  very 
low  TP  and  FP  rates. 

The  goal  of  this  study  is  to  evaluate  the  feasibility  of 
training  a  CNN  to  distinguish  FP  signals  from  true  microcal¬ 
cifications  obtained  from  our  automated  detection  program. 
Although  a  small  dataset  was  used  and  sample  biases  may 
exist,  the  effectiveness  of  the  method  as  one  of  the  steps  in 
the  classification  process  was  demonstrated  by  the  relative 
improvement  in  the  detection  accuracy.  In  the  field  of  CAD, 
it  is  known  that  different  detection  algorithms  or  even  differ¬ 
ent  human  observers  may  generate  FPs  of  different  charac¬ 
teristics.  Before  the  CNN  classifier  is  to  be  incorporated  into 
a  CAD  program  for  clinical  implementation,  it  is  important 
to  train  the  classifier  using  true  and  false  microcalcifications 
obtained  from  the  specific  application.  The  training  dataset 
should  also  be  large  enough  to  ensure  that  the  patient  popu¬ 
lation  is  adequately  represented  and  that  the  performance  of 
the  classifier  can  be  generalized. 

V,  CONCLUSION 

We  have  developed  a  computer  program  for  automated 
detection  of  clustered  microcalcifications  on  mammograms 


for  CAD  applications.  In  this  study,  we  investigated  the  ef¬ 
fectiveness  of  a  new  signal  classifier  based  on  artificial  neu¬ 
ral  network  methodology  for  improvement  of  the  detection 
accuracy  of  the  CAD  program.  The  CNN  classifier  was 
trained  to  recognize  individual  microcalcifications  and  incor¬ 
porated  as  one  of  the  signal  classification  steps.  It  was  found 
that  the  CNN  classifier  can  achieve  a  classification  accuracy, 
expressed  in  terms  of  the  A,  index  with  ROC  analysis,  of 
0.9.  It  reduced  the  average  FP  rates  by  more  than  70%  at  all 
TP  rates  on  mammograms  with  subtle  to  obvious  microcal¬ 
cifications.  Although  the  number  of  cases  used  in  this  study 
is  limited,  the  improvement  is  consistent  and  statistically  sig¬ 
nificant.  This  study  demonstrates  that  a  CNN  can  be  trained 
to  recognize  mammographic  microcalcifications  and  is  effec¬ 
tive  in  reducing  FP  detections  in  CAD  applications. 
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masses.  Of  the  malignant  masses,  45  had  spicuiated  mar¬ 
gins.  Of  the  benign  masses,  6  were  spicuiated.  The  vis¬ 
ibility  of  the  masses  ranged  from  subtle  to  obvious.  The 
average  size  (length  of  the  long  axis)  of  the  masses,  as  esti¬ 
mated  by  the  radiologists,  was  12.2  mm.,  and  the  standard 
deviation  of  the  mass  size  was  4.5  mm.  The  mammograms 
were  randomly  divided  into  training  and  test  groups,  each 
of  which  contained  84  mammograms. 

The  mammograms  were  digitized  wdth  a  LUMISYS  DIS- 
1000  laser  scanner  at  a  pixel  size  of  lOO^m  x  lOO^tm  and 
4096  gray  levels.  The  light  transmitted  through  the  film 
was  amplified  logarithmically  before  analog-to-digital  con¬ 
version.  The  digitizer  had  an  optical  density  (OD)  range 
of  0-3.5.  It  was  calibrated  so  that  the  OD  on  film  was  lin¬ 
early  proportional  to  the  output  pixel  value  in  the  range 
of  about  0.1  OD  to  2.8  OD  with  a  slope  of  0.001  OD/pixel 
value.  The  slope  of  the  calibration  curve  outside  this  range 
decreased  gradually. 

Four  different  ROIs,  each  with  256  x  256  pixels,  were  se¬ 
lected  from  each  mammogram  by  a  radiologist  experienced 
in  mammography.  One  of  the  selected  ROIs  contained  the 
true  mass  which  was  identified  by  an  experienced  radiolo¬ 
gist  and  verified  by  biopsy  reports.  The  remaining  three 
ROIs  contained  breast  parenchyma  that  was  presumed  to 
be  normal,  with  the  first  region  containing  dense  tissue 
which  could  mimic  a  mass  lesion,  the  second  region  con¬ 
taining  mixed  dense/fatty  tissue,  and  the  third  region  con¬ 
taining  fatty  tissue.  An  example  of  each  of  these  ROIs  is 
shown  in  Fig.  2. 

B.  Background  Correction 

The  masses  superimpose  on  structured  background  tissue 
in  the  ROIs.  In  most  cases,  this  background  tissue  is  not 
uniform  over  the  ROI.  For  example,  one  side  of  the  ROI 
may  contain  denser  tissue  than  the  other  side,  or,  when  the 
mass  is  close  to  the  outer  edge  of  the  breast,  one  corner  of 
tiie  ROI  may  contain  a  non-breast  region.  This  may  re¬ 
duce  the  detectability  of  the  mass  by  a  neural  network.  To 
reduce  this  non-uniformity,  we  developed  a  background  cor- 
r^tion  method  that  estimated  the  background  level  based 
on  the  image  intensity  in  a  band  of  pixels  surrounding  the 
ROI. 

We  estimated  a  background  image  from  the  original 

ROI  as  follows.  For  a  given  point  in  the  original 

image,  we  computed  four  averages,  L,  R,  U ^  and  D,  inside 
the  boxes  Ai,  A/?,  Au ,  and  Ad  shown  in  Fig.  3.  The 
centers  of  the  boxes  were  either  at  the  same  row  or  at  the 
same  column  as  (2o,io)-  The  box  size  was  16  x  32  if  it 
could  be  placed  entirely  inside  the  ROI.  Near  the  corners 
of  the  image,  the  box  size  was  gradually  decreased  in  order 
to  avoid  edge  effects.  For  example,  if  the  center  of  the  box 
was  exactly  at  a  corner,  then  the  box  size  was  16  x  16. 
The  background  pixel  5(io,jo)  was  interpolated  from  the 
averages  1,  R,  U .  and  D  as 


B{ioJo) 


^di  df  dll  d(i  ^ 


/ 


T  J_  J_ 

d,  dr  ^ 


dA' 

(11) 


where  d\,  dr,  d^  and  d^  are  the  distances  between  (io,io) 
and  each  side  of  the  image.  The  background  image  was 
then  subtracted  from  the  original  image,  thus  reducing  the 
background  to  near  0. 

As  mentioned  in  the  previous  subsection,  the  average  size 
of  the  masses  was  12.2  mm,  which  corresponded  to  122  pix¬ 
els  after  digitization.  Since  the  masses  were  placed  in  the 
center  of  the  ROIs  in  the  extraction  process,  very  few  of  the 
ROIs  contained  mass  tissue  in  the  16-pixel  wide  band  that 
was  used  for  background  estimation.  Out  of  168  masses 
in  our  database,  only  four  had  a  long  axis  longer  than  220 
pixels,  and  the  long  axis  in  these  four  cases  was  not  exactly 
horizontal  or  vertical.  We  therefore  believe  that  our  estima¬ 
tion  essentially  excluded  information  about  the  mass  itself, 
and  included  only  information  about  the  background.  Fig. 
4  shows  an  example  of  an  ROI  before  and  after  background 
correction. 

C.  Classification  with  Subsampled  Images 

The  simplest  method  of  classifying  mass  and  nonmass  ROIs 
using  a  CNN  would  be  to  input  the  background-corrected 
images  directly  to  the  input  layer  of  the  CNN.  However, 
the  computational  cost  of  inputting  256  x  256  ROIs  into 
a  CNN  was  prohibitive.  We  thus  had  to  reduce  the  image 
size  by  averaging  adjacent  pixels  and  subsampling.  We 
investigated  the  effect  of  reducing  the  image  size  to  16  x  16 
and  32  x  32.  Averaging  was  performed  on  non-overlapping 
boxes  of  size  16  x  16  to  obtain  16  x  16  subsampled  images, 
and  of  size  8  x  8  to  obtain  32  x  32  subsampled  images. 

We  investigated  the  use  of  a  three-layer  CNN  with  a  sin¬ 
gle  input  image,  and  a  single  output  node,  as  shown  in 
Fig.  5.  The  number  of  hidden  layer  groups  A^(2),  and  the 
weight  kernel  size  between  the  input  layer  and  the  hidden 
layer  5u;(l),  were  variable.  For  this  special  case,  CNN  for¬ 
ward  propagation  equations  simplified  considerably.  Let 
=  Ffi  i(2,j)  denote  the  subsampled  input  image, 
Wg{i,j)  =  denote  the  weight  kernel  between  the 

input  layer  and  the  g^^  group  in  the  hidden  layer,  and  let 
W^'(2,j)  =  u)2,i,^'(2, i)  denote  the  weight  kernel  between 
the  group  in  the  hidden  layer  and  the  output  O.  Then, 
the  forward  propagation  equations  (l)-(3)  simplified  to 

72,,  =  g^l . N{2),  (12) 


H2,gCJ) 

and 


_ 1 _ 

l  +  exp{-l2,g(ij))' 


g=.l,,,r,N{2),  (13) 


l  +  exp  (-E?=i^2,j'**W,.)’ 

As  mentioned  in  Section  II.D.,  eight  rotated  and  mirrored 
images  were  applied  to  the  CNN  consecutively  for  each 
averaged-subsampled  ROI.  The  CNN  output  score  for  the 
ROI  was  obtained  as  the  average  of  the  CNN  outputs  for 
these  eight  images.  The  CNN  output  error  for  training 
image  p  was  calculated  as  the  square  of  the  difference  of 
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the  CNN  output  score  and  the  desired  CNN  output.  Back- 
propagation  with  the  delta-bar-delta  rule  was  implemented 
using  equations  (5)-{8). 

After  training,  the  averaged-subsampled  images  belonging 
to  the  test  group  were  applied  to  the  CNN  with  the  trained 
weights,  and  the  CNN  test  output  scores  were  obtained  us¬ 
ing  forward  propagation.  The  CNN  output  scores  were 
used  as  the  decision  variable  in  Receiver  Operating  Char¬ 
acteristics  (ROC)  analysis  [24]  to  evaluate  the  classifica¬ 
tion  performance.  ROC  analysis  evaluates  the  relationship 
between  the  true-positive  fraction  (TPF)  and  the  false¬ 
positive  fraction  (FPF)  as  the  decision  threshold  varies. 
We  estimated  the  ROC  curve  using  the  LABROCl  pro¬ 
gram  [25]  which  assumes  binormal  distributions  of  the  de¬ 
cision  variable  for  the  normal  and  abnormal  cases  and  fits 
the  ROC  curve  based  on  maximum  likelihood  estimation. 
The  area  under  the  ROC  curve,  ,  was  used  as  an  index  of 
classification  accuracy.  The  classification  results  obtained 
with  the  CNN  described  in  this  subsection  are  presented  in 
Section  I\\A. 

D.  ClasstficaUon  with  GLDS  Texture-Images 
D.l  GLDS  Features 

GLDS  features,  extracted  from  the  GLDS  vector  of  an  im¬ 
age,  roughly  measure  the  coarseness  of  the  texture  elements 
in  an  image.  The  GLDS  vector  is  the  histogram  of  the  ab¬ 
solute  value  of  the  difference  of  pixel  pairs  separated  by  a 
distance  dj  in  the  horizontal  direction  and  do  in  the  ver¬ 
tical  direction.  The  vector  d  =  (di.do)  is  called  the  dis¬ 
placement  vector.  As  discussed  below,  the  distribution  of 
the  elements  of  the  GLDS  vector  Pd(k)  indicate  size  of  the 
texture  element  in  the  image  relative  to  the  displacement 
vector  d.  GLDS  features  are  extracted  by  computing  some 
measure  of  the  distribution  of  the  elements  of  the  GLDS 
vector. 

To  compute  the  GLDS  vector  Pd{k)  for  a  given  mam- 
mographic  ROI  //(Lj),  and  a  given  displacement  vec¬ 
tor  d  =  (di.do).  first  a  difference  image  is  computed  as 
—  //(f  -f  di,  j  -h  do)!-  The  entry  of 
the  vector  pd  is  defined  as  the  probability  of  occurrence  of 
the  pixel  value  k  in  the  difference  image  Hd(iJ)- 
If  the  image  texture  is  coarse,  and  the  length  of  the  dis¬ 
placement  vector  d  is  small  compared  to  the  texture  el¬ 
ement  size,  then  the  pixels  separated  by  d  will  usually 
have  similar  pixel  values.  This  implies  that  the  elements  of 
GLDS  vector  will  be  concentrated  around  0,  z.e.,  pd{k)  will 
be  large  for  small  values  of  k,  and  small  for  large  values  of 
k.  Conversely,  if  the  length  of  the  vector  d  is  comparable 
to  the  texture  element  size,  then  the  elements  of  the  GLDS 
vector  will  be  distributed  more  evenly. 

Since  the  image  matrix  is  discrete,  the  displacement  vec¬ 
tor  used  in  feature  calculation  is  usually  chosen  to  have  a 
phase  of  ^  =  0",  45°,  90°,  or  135°.  These  phases  corre¬ 
spond  to  displacement  vectors  of  d  =  (do,0),  d  =  (do,  do), 
d  =  (0,do),  and  d  =  (do,— do),  respectively.  If  image  tex¬ 
ture  is  directional,  features  computed  at  the  same  vector 
magnitude  but  different  phases  will  convey  useful  and  dis¬ 
tinct  information.  In  our  case,  we  did  not  observe  any 


directional  preference  in  the  texture-images  that  we  cal¬ 
culated.  Therefore,  we  averaged  textures  obtained  at  the 
same  vector  magnitude  but  different  phases.  The  vec¬ 
tor  magnitudes  at  displacement  vectors  of  d  =  (do,0), 
d  =  (0,do)  and  d  =  (do,  do),  d  =  (do,— do)  differ  by  a 
factor  of  \/2.  For  this  reason,  we  averaged  the  texture  fea¬ 
tures  obtained  at  ^  =  0°,  90°  and  =  45°,  135°  separately. 
To  reduce  the  number  of  texture  combinations,  we  used 
only  the  averages  obtained  at  ^  =  45°,  135°,  z.e.,  we  av¬ 
eraged  the  texture  features  obtained  at  d  =  (do,  do)  and 
d  =  (do,  —do).  In  the  following  discussion,  we  refer  to  this 
average  as  the  feature  obtained  at  a  texture  distance  of 
do-  The  effect  of  different  texture  distances  on  classifica¬ 
tion  was  evaluated  by  studying  the  classification  accuracy 
at  texture  distances  of  do  =  2,4,  and  8. 

In  this  paper,  we  used  four  GLDS  texture  features,  namely, 
contrast,  angular  second  moment,  entropy  and  mean, 
which  are  defined  in  the  Appendix. 

D.2  GLDS  Texture-Images 

Within  a  selected  ROI,  there  might  be  several  sub-regions 
showing  different  texture  statistics,  for  example,  the  region 
inside  the  mass,  the  transition  region  between  the  mass  and 
the  surrounding  tissue,  and  the  surrounding  tissue.  If  the 
texture  is  computed  for  the  entire  ROI,  the  computation 
result  will  be  an  average  of  the  texture  features  for  the 
different  regions. 

One  can  characterize  these  feature  differences  by  comput¬ 
ing  the  features  in  different  sub-regions  inside  the  ROI.  In 
this  study,  we  moved  the  center  of  the  sub-region  on  a  rect¬ 
angular  grid  over  the  ROI,  and  considered  each  computed 
feature  as  the  pixel  value  of  a  texture-image  at  that  grid 
location.  The  texture-images  were  then  input  into  a  CNN 
for  classification. 

Each  of  the  four  GLDS  features  described  in  the  Appendix, 
namely,  contrast,  angular  second  moment,  entropy  and 
mean,  were  used  to  obtain  GLDS  texture- images.  To  ob¬ 
tain  a  single  pixel  of  a  texture-image,  one  of  these  features  ' 
was  computed  in  a.n  RxR  sub-region  of  the  ROI.  To  obtain 
pixel  values  of  the  texture-image  at  different  pixel  loca¬ 
tions,  the  center  of  the  sub-region  was  moved  over  the  ROI 
on  a  rectangular  grid  with  grid  distance  G.  More  precisely, 
the  element  of  the  texture-image  was  obtained  from 

the  R  X  R  sub-region  whose  upper-left  corner  was  at  pixel 
location  (Gi.Gj)  in  the  original  image.  The  computation 
of  a  texture-image  is  illustrated  in  Fig.  6. 

The  sub-regions  might  or  might  not  overlap  depending  on 
the  relation  between  G  and  R.  The  size  of  the  texture- 
image  was  the  smallest  integer  larger  than  (M  —  R)/Gy 
where  M  was  the  original  ROI  size.  The  classification  re¬ 
sults  with  GLDS  texture-images  reported  in  Section  IV. B. 
were  obtained  with  i?  =  30  and  G  =  15. 

D.3  CNN  with  GLDS  Texture-Images 

The  CNN  architecture  employed  for  classifying  mass  and 
nonmass  ROIs  using  GLDS  texture-images  is  shown  in  Fig. 
7.  This  CNN  had  a  single  hidden  layer  with  three  image 
groups,  a  single  output,  and  two  input  images.  The  first 
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input  image  was  a  16  x  16  averaged-subsampled  image  that 
was  also  used  alone  for  ROI  classification  with  averaged- 
subsampled  images.  The  second  input  image  was  a  16  x  16 
texture-image  obtained  using  one  of  four  GLDS  features, 
contrast,  angular  second  moment,  entropy,  and  mean. 

As  in  the  case  of  CNN  with  averaged-subsampled  images, 
eight  pairs  of  rotated  and  mirrored  images  belonging  to 
each  ROI  were  applied  to  the  CNN  consecutively.  Since 
the  GLDS  features  were  calculated  as  the  average  of  tex¬ 
ture  features  at  ^  45®  and  135®,  we  did  not  have  to  re¬ 

calculate  GLDS  texture-images  for  the  rotated  or  mirrored 
images.  We  only  needed  to  rotate  or  mirror  the  GLDS 
texture-images,  similar  to  the  rotation  and  mirroring  of 
the  averaged-subsampled  images.  As  in  the  case  of  CNN 
with  averaged-subsampled  images,  forward  and  backprop- 
agation  were  accomplished  using  Equations  (3)-(8),  this 
time  with  A"(l)  =  2,  A^(2)  =  3,  and  A^(3)  =  1.  Classifica¬ 
tion  accuracy  was  evaluated  using  the  same  methods  as  in 
Section  III.C. 

E.  Classification  with  SGLD  Texture-Images 
E.l  SGLD  Features 

A  second  method  of  defining  statistical  texture  features  is 
through  the  SGLD  matrix.  SGLD  features  were  previously 
shown  to  be  useful  in  distinguishing  mass  ROIs  from  nor¬ 
mal  tissue  [11],  [12].  To  compute  the  SGLD  matrix  for  an 
image  H(i,  j),  a  displacement  vector  d  =  (di,  do)  is  defined. 
The  element  of  the  SGLD  matrix,  is  defined 

as  the  joint  probability  that  gray  levels  and  /ro  occur  at 
a  distance  of  (di.do)  in 

SGLD  features  are  affected  by  the  number  of  bits  used  to 
represent  the  image  (bit  depth).  Images  of  lower  bit  depth 
can  be  derived  from  images  of  higher  bit  depth  by  eliminat¬ 
ing  the  least  significant  bits.  The  choice  of  bit  depth  used 
in  SGLD  matrix  computation  is  important  because  of  the 
trade-off  between  the  gray  level  resolution  and  the  statis¬ 
tics  of  the  estimated  joint  probability  distribution.  If  the 
bit  depth  is  high,  then  the  number  of  pixels  pairs  that  con¬ 
tribute  to  an  element  of  the  SGLD  matrix  will  be  low,  and 
tite  statistics  of  the  estimated  joint  probability  distribution 
will  be  poor.  The  noise  in  the  least  significant  bits  of  the 
image  will  also  affect  the  distribution.  On  the  other  hand, 
if  the  bit  depth  is  low,  these  two  problems  are  alleviated, 
but  some  of  the  characteristic  features  of  the  distribution 
may  be  lost  due  to  the  reduced  gray  level  resolution.  Based 
on  the  results  of  [11],  we  used  a  bit  depth  of  7  bits  in  SGLD 
matrix  construction. 

SGLD  features  mainly  reflect  the  distribution  of  the  ele¬ 
ments  in  the  SGLD  matrix.  For  example,  the  correlation 
measure  defined  in  [20]  is  high  when  the  entries  are  higher 
along  the  main  diagonal  of  the  SGLD  matrix,  and  the  en¬ 
tropy  measure  attains  its  maximum  value  when  all  the  el¬ 
ements  of  the  SGLD  matrix  are  equal. 

As  in  the  case  of  GLDS  features,  we  used  displacement 
vectors  with  phases  of  ^  =  45®  and  135®  for  SGLD  texture 
feature  calculation.  These  phases  corresponded  to  displace¬ 
ment  vectors  of  d  =  (do,  do)  and  d  =  (do, -do).  Texture 
features  obtained  for  displacement  vectors  of  d  =  (do, do) 


and  d  =  (do, -do)  were  averaged  to  obtain  a  GLDS  fea¬ 
ture  at  a  texture  distance  of  do.  The  effect  of  differ¬ 
ent  texture  distances  on  classification  was  evaluated  by 
studying  the  classification  accuracy  at  texture  distances  of 
do  =  12,16,20,  and  24. 

In  this  paper,  we  used  three  SGLD  features,  namely  corre¬ 
lation,  entropy,  and  difference  entropy,  which  were  among 
the  best  features  for  the  classification  of  masses  and  benign 
tissue  in  a  previous  study  [11].  The  definitions  of  these  fea¬ 
tures  are  given  in  the  Appendix. 

E.2  SGLD  Texture-Images 

The  computation  of  SGLD  texture-images  parallels  the 
computation  of  GLDS  texture-images.  Each  of  the  three 
SGLD  features  described  in  the  Appendix,  namely  corre¬ 
lation,  entropy,  and  difference  entropy,  were  used  to  ob¬ 
tain  GLDS  texture-images.  To  obtain  a  single  pixel  of  a 
texture-image,  one  of  these  features  was  computed  in  an 
Rx  R  sub-region  of  the  ROI.  To  obtain  pixel  values  of  the 
texture-image  at  different  pixel  locations,  the  center  of  the 
sub-region  was  moved  over  the  ROI  on  a  rectangular  grid 
with  grid  distance  G.  Fig.  6,  describes  the  computation  of 
texture- images  pictorially.  The  classification  results  with 
SGLD  texture-images  reported  in  Section  IV. C.  were  ob¬ 
tained  with  R=  QO  and  G  =  13  for  texture  distances  (do) 
of  12  and  16,  and  with  /?  =  75  and  G  =  12  for  texture 
distances  of  20  and  24, 

E. 3  CNN  with  GLDS  Texture-Images 

The  CNN  architecture  employed  for  classifying  mass  and 
nonmass  ROIs  using  SGLD  texture-images  is  the  same  as 
that  used  for  classification  with  GLDS  texture-images,  and 
is  shown  in  Fig.  7.  This  CNN  had  a  single  hidden  layer  with 
three  image  groups,  a  single  output,  and  two  input  images. 
The  first  input  image  was  a  16  x  16  averaged-subsampled 
image  that  was  also  used  alone  for  ROI  classification  with 
averaged-subsampled  images.  The  second  input  image  was ' 
a  16  X  16  texture-image  obtained  using  one  of  three  SGLD 
features,  correlation,  entropy,  and  difference  entropy.  CNN 
training  and  performance  evaluation  was  carried  out  simi¬ 
larly  to  Section  III.C. 

F.  Classification  with  GLDS  and  SGLD  Texture- Images 

The  CNN  architecture  employed  for  classifying  mass  and 
nonmass  ROIs  using  both  GLDS  and  SGLD  texture- images 
is  shown  in  Fig.  8.  We  investigated  the  use  of  a  three-layer 
CNN  with  a  three  input  images,  and  a  single  output  node. 
The  number  of  hidden  layer  groups  N[2),  and  the  weight 
kernel  size  between  the  input  layer  and  the  hidden  layer 
were  variable.  The  first  input  image  was  a  16  x  16 
averaged-subsampled  image,  the  second  input  image  was 
a  16  X  16  texture-image  obtained  using  the  GLDS  mean 
texture-image  at  a  texture  distance  of  do  =  4,  and  the  third 
input  image  was  a  16  x  16  texture-image  obtained  using 
the  SGLD  correlation  texture-image  at  a  texture  distance 
of  do  =  16.  CNN  training  and  performance  evaluation  was 
carried  out  similarly  to  Section  III.C. 
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I\-.  Results 

A.  Results  with  subsampled  images 

For  the  purpose  of  computational  efficiency,  the  image  size 
was  reduced  by  averaging  adjacent  pixels  and  subsampling, 
as  described  in  Section  III.C.  The  resulting  32x32  or  16x  16 
ROIs  were  used  as  inputs  to  a  three-layer  CNN  with  one 
image  group  at  the  input  layer,  A^(2)  image  groups  at  the 
hidden  layer,  and  a  single  output  node.  We  investigated  the 
effect  of  varying  the  number  of  image  groups  iV(2),  and  the 
CNN  weight  kernel  size  Su,(l)  between  the  input  layer  and 
the  hidden  layer.  The  Az  values  for  the  training  and  test 
sets  are  summarized  in  Table  I  for  16  x  16  input  images, 
and  in  Table  II  for  32  x  32  input  images,  respectively.  The 
training  and  test  ROC  curves  for  the  CNN  with  16  x  16 
input  images,  A'(2)  =  3  and  5u;(l)  =  10  are  plotted  in  Fig. 

9,  and  the  corresponding  learning  curves  are  plotted  in  Fig. 

10. 

B.  Results  with  GLDS  features 

The  results  of  Subsection  IV. A  indicate  that  the  perfor¬ 
mance  was  not  significantly  different  (i)  between  16  x  16 
and  32  x  32  input  images;  and  (ii)  among  CNN  architec¬ 
tures  with  different  values  of  A’'(2)  and  5u.(l).  For  this 
reason,  in  this  subsection,  we  chose  to  fix  these  variables 
while  we  studied  the  effect  of  the  texture  feature  and  dis¬ 
tance  variables. 

All  the  CNNs  in  this  subsection  had  iw'o  16  x  16  input 
images,  A"(2)  =  3.  and  5u,(l)  =  10.  The  first  input  image 
was  a  16  X  16  averaged-subsampled  image  that  was  also 
used  in  the  previous  subsection.  The  second  input  image 
was  a  16  X  16  texture-image  obtained  using  one  of  four 
GLDS  features,  contrast,  angular  second  moment,  entropy 
and  mean.  Training  and  test  results  are  summarized  in 
Table  III  for  texture  distances  of  do  =  2,4,  and  8. 

C.  Results  with  SGLD  features 

As  in  Subsection  IV. B,  the  CNNs  in  this  subsection  had 
two  16  X  16  input  images.  A'(2)  =  3.  and  5t,. (1)  =  10. 
The  first  input  image  was  an  averaged-subsampled  image, 
and  the  second  input  image  was  a  texture-image  obtained 
using  one  of  three  SGLD  features,  correlation,  entropy  and 
difference  entropy.  We  used  texture  distances  of  do  =  12, 
16,  20  and  24,  because  the  study  in  [11]  indicated  that  the 
best  classification  accuracy  was  obtained  within  this  range. 
Training  and  test  results  are  summarized  in  Table  IV. 

D.  Results  with  GLDS  and  SGLD  features 

In  Subsections  IV. B  and  IV. C,  the  CNN  architecture  was 
kept  fixed  as  we  studied  the  effect  of  the  texture  feature 
and  distance  variables  for  GLDS  and  SGLD  features.  In 
this  subsection,  we  chose  one  GLDS  and  one  SGLD  fea¬ 
ture.  and  studied  the  effect  of  the  CNN  architecture  as  we 
did  in  Section  IV. A,  but  in  this  case  with  three  input  im¬ 
ages  instead  of  a  single  input  image.  The  first  input  image 
was  a  16  X  16  averaged-subsampled  image  that  was  also 
used  in  the  previous  three  subsections.  The  second  image 
was  a  GLDS  mean  texture-image  at  do  =  4,  and  the  third 


was  an  SGLD  correlation  texture-image  at  do  =  16.  These 
texture-images  were  chosen  because  they  seemed  to  yield 
better  classification  results  than  the  other  texture- images 
as  showm  in  Tables  III  and  IV.  Examples  of  these  three 
CNN  input  images,  for  a  mass  and  three  nonmass  ROIs 
extracted  from  the  same  mammogram,  are  shown  in  Fig. 
11,  along  w’ith  the  background-corrected  ROIs.  We  investi¬ 
gated  the  effect  of  varying  A'(2),  and  5^,(1).  The  A.  values 
for  the  training  and  test  sets  are  summarized  in  Table  V. 
The  training  and  test  ROC  curves  for  the  CNN  architecture 
with  N(2)  =  8  and  5u;(l)  =  10  are  plotted  in  Fig.  12,  and 
the  learning  curves  are  plotted  in  Fig.  13. 

V.  Discussion 

A  comparison  of  Tables  I  and  V  reveals  that  texture-images 
significantly  improve  the  classification  performance.  Con¬ 
sidering  rows  with  the  same  number  of  hidden-layer  image 
groups  and  the  same  kernel  size  in  Tables  I  and  V,  test 
Az  values  in  Table  V  are  0.04  to  0.06  higher  than  their 
counterparts  in  Table  I.  The  best  test  Az  value  in  Table  V 
reaches  0.87,  which,  as  observed  from  Fig.  12,  corresponds 
to  a  TPF  of  90%  at  a  FPF  of  31%.  Figs  10  and  13  indicate 
that  as  training  continued  beyond  a  certain  epoch,  test  Az 
fluctuated  around  a  saturation  value,  and  training  Az  con¬ 
tinued  to  increase.  As  the  number  of  CNN  input  images 
was  increased,  we  observed  a  decline  in  the  CNN  learning 
rate,  t.e.,  more  training  epochs  were  required  for  the  the 
test  Az  curve  (bold  lines  in  Fig.  10  and  Fig.  13)  to  reach 
its  maximum. 

A  comparison  of  different  rows  in  Table  I  or  Table  V  in¬ 
dicates  that  the  effect  of  the  CNN  architecture  on  classi¬ 
fication  accuracy  is  less  important  than  that  of  the  use  of 
texture-images.  For  example,  in  Table  V,  when  the  kernel 
size  was  fixed  at  10,  and  the  number  of  image  groups  was 
varied,  the  test  Az  value  did  not  change  from  its  best  value 
of  0.87.  Test  Az  values  within  Table  1  and  Table  V  differed 
by  0.01  to  0.03  when  the  number  of  image  groups  was  var¬ 
ied  between  3  and  8,  and  the  kernel  size  was  varied  between 
8  and  12.  When  we  varied  the  CNN  architecture,  we  did 
not  observe  a  significant  change  in  the  number  of  training 
epochs  necessary  for  the  test  Az  curve  to  reach  its  maxi¬ 
mum.  The  overall  training  time  on  a  computer  was  longer 
when  the  kernel  size  and  the  number  of  image  groups  were 
large,  since  each  training  epoch  took  a  longer  time  to  run. 
One  has  to  study  all  “reasonable’'  combinations  of  CNN 
architectures  and  texture  feature  variables  in  order  to  op¬ 
timize  the  classification  accuracy.  However,  since  CNN 
training  is  computationally  intensive,  this  would  take  an 
inordinate  amount  of  time.  Instead,  we  attempted  to  find 
the  “best”  combination  of  features,  texture  distance,  and 
CNN  architecture  in  two  stages,  within  the  constraint  of 
computation  time.  First,  in  Sections  IV. B  and  IV. C,  we 
determined  which  features  and  texture  distances  yielded 
better  classification  results  using  a  single  CNN  architec¬ 
ture.  Then,  in  Section  IV. D,  we  varied  the  CNN  architec¬ 
ture  while  the  features  and  texture  distances  were  fixed. 
Clearly,  this  results  in  a  “suboptimal”  combination,  which, 
nevertheless,  produced  satisfactory  classification  results.  It 
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may  be  possible  to  improve  our  results  using  CNNs  that 
employ  more  than  three  input  images,  more  than  a  single 
hidden  layer,  or  more  than  a  single  output  node.  It  may 
also  be  possible  to  use  different  techniques  to  derive  differ¬ 
ent  CNN  input  images  from  an  ROI  to  further  improve  the 
classification  accuracy.  However,  the  results  of  our  limited- 
scale  study  demonstrate  the  viability  of  our  approach. 

Since  a  neural  network  uses  an  iterative  minimization  tech¬ 
nique  in  training,  the  initial  state  of  a  CNN  is  potentially 
important  for  training.  To  obtain  an  indication  about  the 
dependence  of  CNN  performance  on  initial  weight  values, 
we  initialized  the  CNN  in  the  last  row  of  Table  V  (iV(2)  =  8 
and  Su;(l)  =  10)  with  five  different  seeds  for  the  random 
number  generator,  which  produced  five  different  sets  of  ini¬ 
tial  weights.  After  training  and  testing,  we  computed  the 
average  and  standard  deviation  of  the  test  Az  obtained  us¬ 
ing  these  five  sets  of  initial  weights.  The  average  was 
0.87,  z.e.  unchanged  from  the  value  in  Table  V,  and  the 
standard  deviation  was  0.002.  This  indicates  that  the  per¬ 
formance  of  the  CNN  that  we  implemented  is  consistent  in 
spite  of  random  variations  in  the  initial  weights. 

Results  of  Section  lY.A.  indicate  that  there  is  no  significant 
difference  in  classification  accuracy  between  CNNs  that  op¬ 
erate  on  16  X  16  and  32  x  32  subsampled  ROIs.  However, 
this  does  not  mean  that  resolution  of  the  ROI  does  not 
have  any  effect  on  the  classification  accuracy.  It  may  well 
be  possible  that  32  x  32  subsampled  ROIs  still  do  not  con¬ 
tain  enough  detail  to  improve  the  classification  results.  It 
may  also  be  possible  to  significantly  improve  the  classifica¬ 
tion  results  by  applying  larger  ROIs  with  better  resolution 
to  the  CNN.  When  the  computing  power  becomes  avail¬ 
able,  we  will  explore  the  effect  of  the  ROI  resolution  on 
classification  accuracy. 

A  shift-invariant  neural  network  (SINN)  that  is  similar  to 
CNN  was  applied  in  [14]  to  detection  of  microcalcifications 
on  mammograms.  CNN  and  SINN  differ  mainly  in  that 
the  test  and  training  outputs  of  a  SINN  are  images,  as  op¬ 
posed  to  real  numbers.  The  output  images  of  a  SINN  are 
processed  using  thresholding  and  segmentation  techniques 
^before  classification  is  performed.  The  advantage  of  SINN 
is  that  it  yields  a  spatially-invariant  output,  z.e.,  ignoring 
edge  effects,  if  the  input  ROI  is  translated,  the  output  is 
also  translated  by  the  same  amount.  The  advantages  of 
CNN  in  mass  detection  include  (i)  in  training,  one  does 
not  need  to  supply  a  desired  image  to  the  CNN  that  con¬ 
tains  the  pixel  locations  of  the  true  mass,  which  may  be 
difficult  to  obtain  in  many  cases;  and  (ii)  after  testing,  no 
image  processing  is  required  to  perform  classification:  Only 
a  single  threshold  is  used  to  separate  the  two  classes.  In 
our  current  application,  spatial  invariance  is  not  very  criti¬ 
cal  since  the  masses  are  centered  in  the  manually-extracted 
ROIs.  We  are  currently  developing  algorithms  which  will 
center  the  mass  in  an  ROI  obtained  by  an  automated  de¬ 
tection  and  extraction  program. 

In  our  laboratory,  we  have  previously  investigated  two 
other  classification  techniques  using  the  same  ROI  set 
as  in  this  paper  [11],  [12].  The  classification  method  in 
[11]  employs  SOLD  texture  features  obtained  at  fixed  dis¬ 


tances,  and  the  method  in  [12]  employs  multiresolution 
texture  analysis.  The  best  test  results  obtained  in  this 
paper  (A,  =  0.873)  are  better  than  the  best  results  in  [11] 
{Az  =  0.823),  and  comparable  to  the  best  results  in  [12] 
{Az  =  0.859). 

Our  interest  in  CNN  as  an  alternative  classifier  stems 
from  the  fact  that  different  classifiers  are  potentially  bet¬ 
ter  suited  to  classify  different  types  of  masses  and  normal 
tissue.  It  is  not  yet  possible  to  predict  which  masses  will 
be  more  correctly  classified  by  a  CNN  classifier  and  which 
masses  will  be  more  correctly  classified  by  multiresolution 
texture  analysis.  However,  our  experiments  have  shown 
that  combining  the  outputs  of  these  two  classifiers  improves 
classification  accuracy.  In  [26],  combining  the  results  of  a 
CNN  that  operates  on  averaged-subsampled  images  (as  in 
Section  IV.A.),  and  a  classifier  that  operates  on  multires¬ 
olution  texture  features,  we  obtained  an  Az  value  of  0.89 
with  the  same  ROI  set  as  in  this  paper.  A  more  com¬ 
plete  analysis  of  the  application  of  different  classifiers  to 
the  problem  of  ROI  classification  will  be  published  else¬ 
where. 

The  long-term  objective  of  this  research  is  to  develop  a 
CAD  system  which  will  provide  a  second  opinion  to  the 
radiologist  concerning  the  presence  of  lesions  on  a  mam¬ 
mogram.  This  long-term  objective  can  be  divided  into  two 
more-easily  manageable  goals:  (i)  detection  of  suspicious 
ROIs  on  a  mammogram,  and  (ii)  elimination  of  suspicious, 
but  normal  ROIs  from  the  ROIs  detected  in  (i).  This  paper 
deals  with  this  latter  goal.  When  the  research  on  these  two 
goals  are  integrated,  it  will  be  possible  to  conduct  observer 
studies  to  evaluate  the  improvement  in  radiologists  perfor¬ 
mance  when  they  are  assisted  by  CAD.  Although  we  have 
not  attempted  to  compare  the  ROI  classification  accuracy 
reported  in  this  paper  to  that  of  the  radiologists,  we  sus¬ 
pect  that  radiologists  will  perform  significantly  better  than 
a  CNN  in  classifying  the  ROIs  in  this  paper.  However,  the 
contribution  of  the  CAD  system  will  not  be  whether  the 
CAD  outperforms  the  radiologists,  but  rather  how  much 
CAD  assists  radiologists  in  detecting  lesions  that  would 
otherwise  be  missed. 

VI.  Conclusion 

We  studied  the  application  of  a  convolution  neural  network 
to  classification  of  masses  and  normal  ROIs.  CNN  input 
images  were  derived  from  the  ROIs  using  (i)  averaging  and 
subsampling;  (ii)  GLDS  feature  extraction;  and  (iii)  SGLD 
feature  extraction.  Using  a  three-layer  CNN  and  three  in¬ 
put  images  derived  from  each  ROI,  we  obtained  an  average 
test  Az  of  0.87,  which  corresponded  to  an  average  true¬ 
positive  fraction  of  90%  at  a  false  positive  fraction  of  31%. 
Our  results  indicated  that  the  choice  of  CNN  input  images 
is  more  important  than  the  choice  of  CNN  architecture.  Al¬ 
though  classification  performance  needs  to  be  further  im¬ 
proved  in  order  for  the  classifier  to  be  useful  in  a  clinical 
setting,  our  study  indicates  that  a  CNN  can  be  trained 
to  effectively  classify  masses  and  normal  breast  tissue  on 
mammograms.  We  are  currently  investigating  the  effective¬ 
ness  of  the  CNN  classifier  for  differentiation  of  masses  and 
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normal  ROIs  obtained  with  an  automatic  extraction  algo-  where 
rithm  as  a  step  towards  a  fully  automated  computer-aided 
diagnosis  scheme. 
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Appendix 

A.  GLDS  Features 
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Given  a  GLDS  vector  pd{k)  described  in  Section  III.D,  the 
GLDS  texture  features  used  in  this  paper  are  defined  as 
follows  [19],  where  K  is  the  dimension  of 
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B.  SOLD  Features 

Given  an  SGLD  matrix  described  in  Section 

III.E,  the  SGLD  texture  features  used  in  this  paper  are 
defined  as  follows  [20],  where  K  is  the  size  of 

1.  Entropy: 
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TABLE  I 

CNN  CLASSIFICATION  PERFORMANCE  WITH  16  X  16  SUBSAMPLED  IMAGES. 


Kernel  size 

Number  of  groups 

Training  A  - 

Test  Az 

6 

4 

0,82 

0,80 

8 

4 

0,82 

0.81 

10 

4 

0,85 

0.81 

12 

4 

0.87 

0.82 

14 

4 

0.87 

0.81 

10 

3 

0.87 

0.83 

10 

6 

0.85 

0.82 

10 

8 

0.83 

0.81 

TABLE  II 

CNN  CLASSIFICATION  PERFORMANCE  WITH  32  X  32  SUBSAMPLED  IMAGES. 


Kernel  size 

Number  of  groups 

Training  Az 

Test  Az 

11 

4 

0.81 

0.80 

16 

4 

0.84 

0.80 

20 

4 

0,84 

0.83 

23 

4 

0.84 

0.82 

20 

3 

0.84 

0.82 

20 

6 

0.84 

0.82 

20 

8 

0.84 

0.82 

TABLE  III 

CNN  CLASSIFICATION  PERFORMANCE  WITH  TWO  INPUT  IMAGES  DERIVED  FROM  AN  ROI.  THE  FIRST  IMAGE  IS  THE  AVERAGED  AND  SUBSAMPLED 
IMAGE,  THE  SECOND  IMAGE  IS  THE  GLDS  TEXTURE-IMAGE.  ASM,  CON,  G-ENT,  AND  MEAN  STAND  FOR  ANGULAR  SECOND  MOMENT,  CONTRAST, 

ENTROPY,  AND  MEAN,  RESPECTIVELY, 


Feature 

do  = 

Training  Az 
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Test  Az 

do  = 
Training  Az 
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Test  Az 

do  = 

Training  Az 
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Test  A  2 

.A.S.VI 
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G_ENT 

0.84 

0.91 

0.85 

0.89 

0.84 

MEAN 

1  0.90 

0.84 

0.90 

0.86 

0.88 

0.85 

TABLE  IV 

CNN  CLASSIFICATION  PERFORMANCE  WITH  TWO  INPUT  IMAGES  DERIVED  FROM  AN  ROI.  THE  FIRST  IMAGE  IS  THE  AVERAGED  AND  SUBSAMPLED 
IMAGE,  THE  SECOND  IMAGE  IS  THE  SGLD  TEXTURE-IMAGE.  COR,  DIFJENT,  AND  S_ENT  STAND  FOR  CORRELATION,  DIFFERENCE  ENTROPY,  AND 

ENTROPY,  RESPECTIVELY. 
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12 
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S_ENT 
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TABLE  V 

CNN  CLASSIFICATION  PERFORMANCE  WITH  THREE  INPUT  IMAGES  DERIVED  FROM  AN  ROI.  ThE  FIRST  IMAGE  IS  THE  AVERAGED  AND  SUBSAMPLED 
IMAGE,  THE  SECOND  IMAGE  IS  THE  GLDS  MEAN  TEXTURE-IMAGE  AT  d©  =  4,  THE  THIRD  IMAGE  IS  THE  SGLD  CORRELATION  TEXTURE- IMAGE  AT 

do  =  16. 


Kernel  size 

Number  of  groups 

Training  Az 
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0.90 
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Figure  Captions 


Fig.  1.  Basic  CNN  architecture. 

Fig.  2.  An  example  of  the  mass  and  and  normal  ROIs  selected  from  one  of 
the  mammograms  used  in  this  study.  The  four  ROIs  are:  upper  left- 
mass;  upper  right-mixed  dense/fatty  tissue;  lower  left-dense  tissue;  lower 
right-fatty  tissue. 

Fig.  3.  The  averaging  boxes  and  distances  used  in  background  correction. 

Fig.  4.  An  example  of  background  correction,  (a)  Original  ROI  that  contains 
a  malignant  mass,  (b)  Background  corrected  ROI. 

Fig.  5.  The  CNN  architecture  used  for  ROI  classification  with  averaged-sub- 
sampled  ROIs. 

Fig.  6.  Computation  of  texture-images. 

Fig.  7.  The  CNN  architecture  used  for  ROI  classification  with  averaged-sub- 
sampled  ROIs  plus  a  texture-image. 

Fig.  8.  The  CNN  architecture  used  for  ROI  classification  with  averaged-sub- 
sampled  ROIs  plus  the  GLDS  mean  texture-image,  and  the  SGLD  corre¬ 
lation  texture-image. 

Fig.  9.  ROC  curve  for  CNN  with  the  16  x  16  averaged-subsampled  input  image, 
A^(2)  =  3,  and  5u;(l)  =  10.  The  value  was  0.87  for  training  and  0.83 
for  test. 

Fig.  10.  Training  and  test  Az  values  versus  training  epoch  number  for  the 
CNN  in  Fig.  9. 

Fig.  11.  Background  corrected  image,  subsampled  image,  GLDS  mean  texture- 
image  at  do  =  4,  and  SGLD  correlation  texture-image  at  do  =  16  for  (a)  a 
mass  ROI,  as  shown  in  Fig. ^  4b.,  and  (b-d)  three  nonmass  ROIs  extracted 
from  the  same  mammogram. 

Fig.  12.  ROC  curve  for  CNN  with  three  input  images,  N(2)  =  8  and  Su;(l)  = 
10.  The  first  input  image  is  the  16  x  16  averaged-subsampled  image,  the 
second  image  is  the  GLDS  mean  texture-image  at  do  =  4,  and- the  third 
image  is  the  SGLD  correlation  texture-image  at  do  =  16.  The  Az  value 
was  0.91  for  training  and  0.87  for  test. 

Fig.  13.  Training  and  test  Az  values  versus  training  epoch  number  for  the 
CNN  in  Fig.  12. 
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Classification  of  mass  and  normal  breast  tissue  on  digital  mammograms: 
Multiresolution  texture  analysis 

Datong  Wei,  Heang-Ping  Chan,^>  Mark  A.  Helvie,  Berkman  Sahiner,  Nicholas  Petrick, 

Dorit  D.  Adler,  and  Mitchell  M.  Goodsitt 

Department  of  Radiology,  University  of  Michigan,  Ann  Arbor,  Michigan 
(Received  16  August  1994;  accepted  for  publication  31  May  1995) 

We  investigated  the  feasibility  of  using  multiresolution  texture  analysis  for  differentiation  of  masses 
from  normal  breast  tissue  on  mammograms.  The  wavelet  transform  was  used  to  decompose  regions 
of  interest  (ROIs)  on  digitized  mammograms  into  several  scales.  Multiresolution  texture  features 
were  calculated  from  the  spatial  gray  level  dependence  matrices  of  (1)  the  original  images  at 
variable  distances  between  the  pixel  pairs,  (2)  the  wavelet  coefficients  at  different  scales,  and  (3) 
the  wavelet  coefficients  up  to  cenain  scale  and  then  at  variable  distances  between  the  pixel  pairs.  In 
this  study,  168  ROIs  containing  biopsy-proven  masses  and  504  ROIs  containing  normal  paren¬ 
chyma  were  used  as  the  data  set.  The  mass  ROIs  were  randomly  and  equally  divided  into  training 
and  test  groups  along  with  corresponding  normal  ROIs  from  the  same  film.  Stepwise  linear  dis¬ 
criminant  analysis  was  used  to  select  optimal  features  from  the  multiresolution  texture  feature  space 
to  maximize  the  separation  of  mass  and  normal  tissue  for  all  ROIs.  We  found  that  texture  features 
at  large  pixel  distances  are  important  for  the  classification  task.  The  wavelet  transform  can  effec¬ 
tively  condense  the  image  information  into  its  coefficients.  With  texture  features  based  on  the 
wavelet  coefficients  and  variable  distances,  the  area  A.  under  the  receiver  operating  characteristic 
curve  reached  0.89  and  0.86  for  the  training  and  test  groups,  respectively.  The  results  demonstrate 
that  a  linear  discriminant  classifier  using  the  multiresolution  texture  features  can  effectively  classify 
masses  from  normal  tissue  on  mammograms. 

Key  words:  mammography,  computer-aided  diagnosis,  mass,  wavelet  transform,  multiresolution 
texture  analysis,  linear  discriminant  classifier 

region  and  overlaps  with  the  frequency  components  of  the 
normal  tissue.  The  gray  level  changes  at  the  mass  boundary 
are  usually  gradual  and  not  as  abrupt  as  those  at  the  bound¬ 
ary  of  microcalcifications.  Moreover,  the  shape,  size,  and  the 
gray  level  profile  of  the  masses  vary  from  case  to  case.  These 
cause  difficulties  in  the  application  of  conventional  image 
processing  methods  to  the  detection  and  feature  characteriza¬ 
tion  of  masses. 

Morphological  features  have  been  used  to  decrease  the 
number  of  false-positive  detections.’*  Spiculated  masses 
were  the  focus  of  detection  in  the  analysis  of  edge  orienta¬ 
tion  in  Kegelmeyer’s  work.’*’“  Breast  cancers  can  also  mani¬ 
fest  as  circumscribed  masses. Selective  median  filtering 
and  template  matching  techniques  were  proposed  to  detect 
suspicious  circumscribed  masses.’*^  For  both  types  of  masses, 
texture  features  were  extracted  from  regions  of  interest 
(ROIs)  in  digital  mammograms  and  were  used  in  a  decision 
tree  to  classify  the  masses  from  normal  tissue  with  some 
success.’^ 

The  discovery  of  cortical  neurons  which  respond  specifi¬ 
cally  to  stimuli  within  certain  orientations  and  spatial  fre¬ 
quencies  suggests  that  multiorientation  and  multiresolution 
are  part  of  the  biological  mechanism  of  the  human  visual 
system. Interest  in  multiresolution  image  analysis  has 
been  growing  rapidly  in  the  field  of  computer  vision.  A  mul¬ 
tiresolution  representation  provides  a  simple  hierarchical 
framework  for  analyzing  image  information.  The  compres¬ 
sion  of  images  by  wavelet  transforms  can  achieve  a  high 
compression  ratio  without  significant  loss  of  image  details,’® 
indicating  that  important  image  features  are  condensed  in  the 
wavelet  coefficients.  Texture  analysis  in  the  wavelet  trans- 


I.  INTRODUCTION 

Mammography  is  considered  the  most  reliable  method  for 
the  early  detection  of  breast  cancers.’  However,  it  has  been 
reponed  that  radiologists  do  not  detect  all  breast  cancers  that 
are  visible  on  mammograms  in  retrospective  studies.^"^  Pre¬ 
vious  studies  indicate  that  computer-aided  diagnosis  (CAD) 
can  provide  a  second  opinion  to  the  radiologists  and  poten¬ 
tially  decrease  the  missed  detection  rate.^*^  Computerized 
classification  of  the  malignant  or  benign  features  of  an  ab¬ 
normality  may  also  be  expected  to  reduce  the  number  of 
negative  biopsies.  Improvement  in  the  accuracy  of  mammog¬ 
raphy  will  increase  its  efficacy  for  screening  and  diagnosis  of 
breast  cancer. 

Computer  vision  and  artificial  intelligence  techniques 
have  been  developed  to  detect  or  characterize  abnormalities 
on  digital  mammograms.®  Image  processing  is  usually  a  first 
step  in  computer  vision  to  enhance  the  signal-to-noise  char¬ 
acteristics  of  the  objects  being  detected.  Features  are  then 
extracted  for  classification  between  the  signal  and  the  back¬ 
ground.  Microcalcifications  are  ideal  targets  for  computer  de¬ 
tection  due  to  their  clinical  relevance,  their  potential  subtlety, 
and  the  lack  of  coexisting  normal  structures  that  have  the 
same  appearance.®  The  detection  and  classification  of  micro¬ 
calcifications  have  received  a  lot  of  attention  and  demon¬ 
strated  significant  progress.  Breast  masses  are  more  difficult 
to  detect  and  classify  than  microcalcifications  because 
masses  can  be  simulated  or  obscured  by  normal  breast 
parenchyma.^* Fourier  analysis  of  the  masses  does  not 
show  consistent  and  distinctive  high-frequency  components. 
Most  of  the  signal  (mass)  energy  is  in  the  low-frequency 
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form  domain  was  used  to  distinguish  different  texture  pat¬ 
terns  (e.g..  French  canvas,  beach  sand,  and  oriental  straw 
cloth)  with  some  success.”  Wavelet  transform  has  been  ap¬ 
plied  to  mammographic  image  processing,  especially  to  the 
enhancement  and  detection  of  microcalcifications.  Laine 
proposed  adaptive  multiscale  processing  with 
wavelet  decomposition  and  reconstruction  for  feature  analy¬ 
sis  and  contrast  enhancement.  Richardson"'  discussed  the  use 
of  wavelet  packets  that  can  be  superior  to  wavelets  for  cer¬ 
tain  classes  of  mammographic  signals.  Qian  et  al."  proposed 
a  tree-structured  nonlinear  adaptive  filter  and  the  wavelet 
transform  for  the  detection  and  segmentation  of  microcalci¬ 
fications  on  mammograms. 

In  this  paper,  we  discuss  the  application  of  multiresolution 
texture  analysis  to  digitized  mammograms  to  distinguish 
mass  from  normal  tissue.  Multiresolution  texture  features 
were  extracted  from  the  spatial  gray  level  dependence 
(SGLD)  matrices  (1)  of  the  original  image  at  variable  dis¬ 
tances.  (2)  of  the  wavelet  coefficients  at  different  scales,  and 
(3)  of  the  wavelet  coefficients  up  to  certain  scales  and  then  at 
vanable  distances,  forming  three  feature  vectors  for  each 
ROI.  We  used  stepwise  linear  discriminant  analysis  to  select 
features  from  each  of  these  three  texture  spaces  to  maximize 
the  separation  of  masses  and  normal  tissue.  The  ability  of  the 
three  feature  vectors  for  classifying  mammographic  masses 
and  normal  tissue  was  compared.  Receiver  operating  charac¬ 
teristic  analysis  was  used  to  evaluate  the  classification  accu¬ 
racy  of  the  texture  features  from  the  different  feature  spaces. 

II.  METHODS 

A.  Database  selection 

The  mammograms  used  in  this  study  were  randomly  se¬ 
lected  from  the  files  of  patients  who  had  undergone  biopsies 
in  the  Depanment  of  Radiology  at  the  University  of  Michi¬ 
gan.  The  mammograms  were  acquired  with  dedicated  mam¬ 
mographic  systems  with  a  0.3-mm  focal  spot,  a  molybdenum 
anode.  0,03-mm-thick  molybdenum  filter,  and  a  5: 1  recipro¬ 
cating  grid  or  a  stationary  grid.  The  image  receptor  was  a 
Kodak  MinR/MRE  screen/film  system  with  extended  cycle 
processing.  Our  selection  criterion  was  that  a  biopsy-proven 
mass  could  be  seen  on  the  mammogram.  Initially,  more  than 
300  mammograms  were  acquired.  To  avoid  the  effect  of  the 
repetitive  grid  pattern  on  the  texture  feature  calculation  and 
the  classification,  all  mammograms  with  grid  lines  were  ex¬ 
cluded.  Our  final  data  set  was  composed  of  168  mammo¬ 
grams. 

The  mammograms  were  digitized  with  a  laser  film  scan¬ 
ner  (LUMISYS  DIS-1000)  at  a  pixel  size  of  0.1  mmXO.l 
mm  and  4096  gray  levels.  The  light  transmitted  through  the 
mammographic  films  was  amplified  logarithmically  before 
digitization.  After  the  calibration,  the  pixel  values  were  lin¬ 
early  proportional  to  the  optical  density  in  the  range  of  0.1- 
2.8  optical  density  units.  The  slope  of  the  calibration  curve 
decreases  gradually  outside  this  range. 

Before  an  automated  computer  segmentation  procedure 
was  developed,  we  used  manual  ROI  extraction  to  study  the 
feasibility  of  using  texture  features  for  the  classification  of 
mass  and  normal  tissue  in  all  types  of  breast  parenchyma. 


Four  different  ROIs,  each  with  256X256  pixels,  were  se¬ 
lected  manually  from  each  mammogram.  One  ROI  contained 
a  true  mass  which  was  identified  by  an  experienced  mam- 
mographer.  A  second  contained  normal  parenchyma  includ¬ 
ing  the  densest  tissue  on  that  mammogram,  a  third,  mixed 
dense/fatty  tissue,  and  a  fourth,  fatty  tissue.  Figure  1  shows 
the  672  ROIs  from  the  168  mammograms  in  reduced  spatial 
resolution.  The  168  case  samples  in  the  data  set  contained  a 
mixture  of  benign  (/i  =  83)  and  malignant  (n  =  85)  masses. 
Forty-five  of  the  malignant  masses  and  six  of  the  benign 
masses  were  spiculated.  The  visibility  of  the  masses  was 
ranked  by  experienced  radiologists  on  a  scale  of  1-10  (1 
=most  obvious.  10=most  subtle),  which  corresponded  to  the 
range  of  masses  seen  on  clinical  mammograms.  The  length 
of  the  long  axis  (size)  of  the  masses  was  also  measured  by 
the  radiologists.  The  distributions  of  the  visibility  scores  and 
the  sizes  are  shown  in  Fig.  2.  It  can  be  seen  from  Figs.  1  and 
2  that  the  masses  with  different  shapes  and  visibility  found  in 
clinical  practice  were  fairly  well  represented  in  the  data  set. 

B.  Texture  features 

The  input  images  were  digitized  to  12  bits  of  resolution. 
The  average  gray  level  of  each  ROI  of  the  images  was  re¬ 
moved  and  replaced  by  a  constant  for  all  the  ROIs  before  the 
texture  analysis  and  wavelet  transform  were  performed  in 
order  to  reduce  the  variability  of  the  texture  features  caused 
by  exposure  conditions.  The  texture  features  were  calculated 
based  on  the  SGLD  matrix,  also  known  as  the  concurrence  or 
co-occurrence  matrix.^^-^''  The  (/,j)-th  element  of  the  SGLD 
matrix,  Pd,e(.iJ)>  is  the  joint  probability  that  the  gray  levels 
i  and  j  occur  in  a  direction  0  at  a  distance  of  d  pixels  apart 
{d  is  the  distance  in  terms  of  number  of  pixels  and  is  referred 
to  as  pixel  distance  in  the  following  discussion)  over  the 
entire  ROI.  The  joint  probability  describes  the  frequency  that 
a  pair  of  gray  level  values  occurs  between  pixel  pairs  with  a 
defined,  relative  spatial  relationship.  The  SGLD  matrix  is  a 
two-dimensional  histogram.  The  matrix  size  depends  on  the 
gray  level  resolution  (i.e.,  the  bit  depth)  of  the  digitized  im¬ 
age  and  the  bin  width  used  in  determining  the  histogram.  If 
the  gray  level  resolution  is  n  bits  and  the  bin  width  is  b  gray 
levels,  then  the  size  of  the  SGLD  matrix  will  be  aXa,  where 
a  =  2"/b.  For  example,  for  a  12-bit  image,  the  matrix  size  of 
an  SGLD  matrix  constructed  with  a  bin  width  of  1  gray  level 
is  4096X4096.  The  matrix  size  is  reduced  to  256X256  if  a 
bin  width  of  16  gray  levels  is  used.  The  increased  bin  width 
is  equivalent  to  reducing  the  gray  level  resolution  of  the 
12-bit  image  to  8  bits  by  eliminating  the  4  least  significant 
bits  and  using  a  bin  width  of  1  gray  level  in  determining  the 
SGLD  matrix.  Based  on  the  findings  of  our  previous  study,^^ 
8-bit  gray  level  resolution  provided  the  best  classification 
accuracy  when  texture  features  calculated  at  a  fixed  pixel 
distance  d  were  used.  Therefore,  8-bit  gray  level  resolution 
was  chosen  for  the  formulation  of  the  SGLD  matrices  in  this 
study. 

Eight  texture  features  were  examined:  correlation,  energy, 
entropy,  inertia,  inverse  difference  moment,  sum  average, 
sum  entropy,  and  difference  entropy.  Some  of  the  texture 
features  can  be  used  to  describe  some  visual  properties  of  the 
images  while  others  may  be  more  abstract.  For  example,  cor- 
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(a) 


Fig.  I.  The  168  case  samples  used  in  this  study  with  ROls  containing  (a)  biopsy-proven  masses,  (b)  dense  breast  tissue,  id  mixed  dense/fatty  brea.st  ti.ssue. 
and  (d)  fatty  breast  tissue.  The  upper  halves  are  the  G|  cases  and  the  lower  halves  the  cases. 
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Fig.  2.  The  di.smbuiion  or  la)  the  visibility  score  and  lb)  the  size  of  the  168 
masses. 


relation  i.s  a  measure  of  gray  level  dependency.  Energy  (or 
angular  second  moment)  and  entropy  are  measures  of  pixel 
homogeneity.  Inertia  (or  contrast)  represents  the  amount  of 
intensity  vanation.  It  is  difficult,  however,  to  relate  specific 
image  characteristics  to  each  of  these  features.  The  math¬ 
ematical  definitions  of  the  features  can  be  found  in  the 
literature^’^‘"''  “"  and  are  given  in  Appendix  B. 

Each  texture  feature  was  calculated  at  ^=0*^,  45°.  90°,  and 
1 35°  for  specified  distances  and/or  scales.  Since  it  is  ex¬ 
pected  that  the  shape  and  the  texture  of  masses  in  the  ROls 
do  not  have  angular  preferences,  w'e  averaged  the  features  at 
0=0°^  90°,  and  at  i9=45°.  135°.  and  referred  to  these  aver¬ 
aged  features  as  features  at  0°  and  45°,  respectively,  in  the 
following  discussion.  For  a  given  pixel  distance,  the  actual 
distance  between  the  pixels  on  the  image  at  45°  was  equal  to 
V2  times  the  actual  distance  at  0°.  When  the  pixel  distance 
increased,  the  differences  in  the  actual  distances  between 
these  angles  become  more  significant.  Because  the  texture 
features  depended  on  the  actual  distance  between  the  pixel 
pairs,  the  features  at  the  two  angles  were  treated  separately  in 
our  multiresolution  texture  analysis. 

C.  Wavelet  transform 

The  wavelet  transform  produces  a  multiscaie  representa¬ 
tion  of  an  image  in  which  the  geometric  structures  of  the 
image  are  preserved  within  each  sub-band  or  level.  In  Ap¬ 
pendix  A,  we  present  a  brief  introduction  to  the  wavelet 
transform.  More  details  of  the  theory  and  applications  can  be 
found  in  the  literature. 

.Vlailat  presented  a  multiresolution  framework  with  the 
discrete  wavelet  transform  inherently  embedded. In  this 


FiC.  3.  Wavelet  decomposition  from  level  0  (LO  or  scale  1 )  to  level  a  (L4  or 
scale  16)  of  (a)  an  ROI  with  a  mass  and  (b)  an  ROI  with  normal  breast 
tissue. 


framework,  the  original  image  ( Y)  that  has  the  highest  reso¬ 
lution  is  referred  to  as  level  0  (/=0)  or  scale  1  (.T  =  2'|,  =  n)- 
At  .scale  2,  the  original  image  is  decomposed  in  the  wavelet 
transform  domain  (similar  to  the  spatial  frequency  domain  in 
Fourier  transform)  into  a  low-pass  .sub-band  image 
(referred  to  as  approximation  image  at  level  1  or  scale  2. 
low-pass  low-pass  quadrant)  and  three  bandpass  sub-band 
images  y^^  (referred  to  as  detail  images  in  the 

low-pass  high-pass,  high-pass  low-pass,  and  high-pass  high- 
pass  quadrants).  At  the  next  scale  (scale  4).  the  approxima¬ 
tion  image  at  scale  2,  yt^,  is  decomposed  further  into  a 
low-pass  sub-band  approximation  image  y^^  and  three  more 
bandpass  sub-band  images  yjf^.  y?^.  ^  The  decomposi¬ 

tion  can  be  stopped  at  some  desired  (lower)  resolution  or 
(larger)  scale.  Figures  3(a)  and  3(b)  illustrate  the  w-avelet 
decomposition  to  level  4  or  scale  16  of  the  ROIs  containing 
a  mass  and  normal  parenchyma,  respectively.  The  recon¬ 
struction  of  an  image  from  the  wavelet  coefficients  in  the 
transform  domain  starts  from  the  lowest  resolution  (largest 
scale)  sub-band  images. 

In  this  study,  Daubechies’  filter  with  four  coefficients' 
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was  used  as  the  wavelet  filter  for  image  decomposition  (see 
Appendix  A).  A  filter  with  a  small  number  of  wavelet  coef¬ 
ficients  was  chosen  because  the  width  of  the  uncertainty 
band  at  the  image  boundary  caused  by  convolution  would  be 
narrower.  This  allowed  the  decomposition  to  be  performed  to 
larger  scales  while  still  providing  a  sufficient  number  of  us¬ 
able  pixels  in  the  approximation  image  for  the  construction 
of  an  SOLD  matrix.  The  chosen  filter  was  also  separable  so 
that  the  fast  wavelet  transform  algorithm  could  be  employed 
in  two-dimensional  image  analysis. 

D.  Muttiresolution  texture  feature  space 

We  used  the  original  image  Y  (scale  1)  and  the  low-pass 

sub-band  approximation  image  (scale  2',  /  =  ! . 4)  to 

formulate  SOLD  matrices  at  multiple  scales.  The  distance  of 
the  pixel  pairs  used  at  each  scale  was  one  pixel.  The  decom¬ 
position  stopped  at  scale  16  so  that  the  approximation  image 
in  the  transform  domain  had  16X16  pixels.  Effectively,  the 
pixel  distances  of  SOLD  matrices  formulated  in  this  way  at 
scales  of  1,  2,  4,  8,  and  16  corresponded  to  pixel  distances  of 
1,  2,  4,  8,  and  16  in  the  original  image.  A  total  of  80  features 
were  calculated  from  each  ROI  (8  featuresX2  anglesXS  lev¬ 
els)  in  this  feature  space.  These  80-dimensional  feature  vec¬ 
tors  based  on  the  wavelet  transform  were  denoted  as  F^r- 

As  the  scale  in  the  wavelet  transform  increased,  the  sta¬ 
tistical  fluctuations  in  the  SOLD  matrices  based  on  the 
smaller  and  smaller  images  could  not  be  neglected  due  to  the 
random  sample  errors.  To  reduce  the  statistical  error  in  the 
SOLD  matrices,  we  decomposed  the  original  ROIs  by  wave¬ 
let  transform  to  scale  4  so  that  the  smallest  image  size  was 
64  X  64  pixels.  Then  the  wavelet  filter  was  applied  once  more 
without  downward  sampling.  The  resulting  wavelet  coeffi¬ 
cients  were  obtained  at  scale  8  and  w-ere  overcompleie  and 
redundant.“^*'^^  However,  this  allowed  the  number  of  pixels 
used  to  construct  the  SOLD  matrices  to  be  kept  at  64  X  64. 
The  SOLD  matrices  at  scale  8  were  then  constructed  with 
distances  of  2,  3.  4,  5,  6,  7,  8.  9,  10.  11.  and  12.  These 
distances  between  pixel  pairs  were  equivalent  to  the  dis¬ 
tances  of  8,  12,  16,  20,  24,  28.  32.  36.  40.  44,  and  48  in  the 
original  image.  Therefore,  a  total  of  224  features  were  cal¬ 
culated  from  each  ROI  [8  features X 2  anglesX(I  pixel  dis¬ 
tance  at  scales  1,  2,  4  +  1 1  pixel  distances  at  scale  8)]  in  this 
feature  space.  The  feature  vectors  in  this  224-dimensional 
feature  space  were  based  on  wavelet  transform  and  variable 
distances,  and  were  denoted  as  F^v  • 

To  evaluate  the  effect  of  the  wavelet  transform  on  the 
classification  results,  we  compared  the  features  described 
above  to  those  extracted  from  the  SOLD  matrices  of  the 
original  image.  The  SOLD  matrices  were  constructed  with 
pixel  distances  of  1,  2,  4,  8,  12,  16,  20,  24,  28,  32,  36,  40,  44, 
and  48.  These  distances  corresponded  to  those  used  in  the 
calculation  of  Fwv  when  the  latter  were  convened  to  equiva¬ 
lent  pixel  distances  in  the  original  image.  Therefore,  a  total 
of  224  features  were  calculated  from  each  ROI  (8  features  X  2 
angles  X  14  pixel  distances)  in  this  feature  space.  These  fea¬ 
ture  vectors  based  on  SOLD  matrices  from  the  original  im¬ 
ages  with  variable  distances  were  denoted  as  Fyo*  the 
224  features,  we  could  also  select  a  subset  of  80  features  at 


(i=l,  2.  4,  8,  and  16.  The  pixel  distances  in  this  subset  cor¬ 
responded  to  the  pixel  distances  used  for  the  calculation  of 
the  features  in  Fy^--  80-dimensional  feature  vectors  ob¬ 
tained  from  this  subset  of  features  with  variable  distances 
were  denoted  as  Fyos  *  impact  of  the  wavelet  transform 
on  the  discriminant  power  of  the  texture  features  was  studied 
by  comparing  the  classification  results  obtained  with  Fvd 
and  Fvds  those  obtained  with  Fwv  and  Fwt^  respectively. 

E.  Linear  discriminant  analysis 

Linear  discriminant  analysis^  ‘  is  a  systematic  statistical 
technique  to  classify  individuals  or  cases  into  one  of  the 
mutually  exclusive  classes  based  on  certain  indices  or  pre¬ 
dictor  variables.  These  indices  or  predictor  variables  may 
have  certain  correlations  with  one  another.  In  a  two-class 
classification  problem,  for  example,  a  linear  combination  of 
these  variables  is  formed  and  the  coefficients  are  determined 
based  on  certain  optimization  criteria.  One  of  such  criteria, 
proposed  by  Fisher,  is  that  the  ratio  of  the  difference  of  the 
means  of  the  linear  combination  in  the  two  classes  to  its 
variance  is  maximized.^ 

The  discriminant  analysis  in  the  SPSS  software  package 
[M.  J.  Norusis,  SPSS  for  Windows  Professional  Statistics, 
Release  6.0  (SPSS  Inc.,  Chicago,  IL,  1993)]  was  used  in  this 
study.  The  extended  feature  spaces  as  explained  above  were 
each  used  as  a  pool  of  predictor  variable  candidates  for  a 
two-class  discriminant  analysis  that  contained  a  mass  class 
and  a  normal  tissue  class.  Similar  to  the  situation  of  multiple 
linear  regression,  including  a  large  number  of  possible  pre¬ 
dictor  variables  in  the  linear  model  of  the  discriminant  func¬ 
tion  is  not  a  good  strategy.  Inclusion  of  irrelevant  variables 
will  not  improve  the  classification  accuracy  and  will  de¬ 
crease  the  generalization  capability  of  the  classifier.  Because 
of  the  large  number  of  features  in  the  pools,  it  is  a  formidable 
task  to  test  all  different  feature  combinations  at  different 
numbers  of  feature  variables  to  find  the  best  combination. 
Therefore,  we  utilized  a  stepwise  feature  selection  procedure 
to  select  predictor  variables  in  each  feature  space.  Five  se¬ 
lection  criteria  are  provided  in  the  SPSS  package,  including 
(1)  the  minimization  of  Wilks’  lambda,  (2)  the  minimization 
of  unexplained  variance,  (3)  the  maximization  of  the 
between-class  F  statistic,  (4)  the  maximization  of  Mahalano- 
bis  distance,  and  (5)  the  maximization  of  Lawley-Hotelling 
trace  (Rao's  V).  For  each  feature  space,  we  tested  all  avail¬ 
able  selection  criteria.  With  each  criterion,  we  performed 
stepwise  feature  selection  on  all  the  168  cases  using  the  pro¬ 
gram  default  values  for  the  inclusion  and  exclusion  threshold 
parameters  and  the  termination  criterion.  The  selection  crite¬ 
rion  that  provided  the  best  classification  result  would  be  cho¬ 
sen.  Since  the  program  default  values  of  the  parameters 
might  not  be  the  optimal  choices  for  our  application,  we 
varied  the  parameter  values  of  the  chosen  criterion  in  an 
attempt  to  further  improve  the  classification  results.  For  our 
data  sets,  when  the  thresholds  were  set  higher  than  the  de¬ 
fault  values,  fewer  feature  variables  would  be  included  and 
the  classification  accuracy  decreased.  When  the  thresholds 
were  set  lower  than  the  default  values,  more  features  would 
be  included  and  the  classification  results  might  improve. 
However,  when  the  thresholds  were  lowered  further  and  too 
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many  features  were  included,  the  classification  would  dete¬ 
riorate.  The  set  of  feature  variables  that  provided  the  best 
classification  in  this  selection  process  were  used  for  the  for¬ 
mulation  of  the  discriminant  function  in  the  given  feature 
space.  For  simplicity,  we  will  refer  to  this  stepwise  selection 
procedure  with  different  thresholds  as  a  stepwise  or  auto¬ 
matic  selection  process.  Our  feature  selection  process  was  by 
no  means  exhaustive.  However,  it  would  represent  the  best 
selection  achievable  within  reasonable  computational  re¬ 
quirements. 

To  evaluate  the  capability  of  generalization  of  a  trained 
classifier,  we  randomly  divided  the  168  cases  into  two 
groups  (G)  and  G2)  of  equal  size.  We  used  the  features  se¬ 
lected  with  the  procedure  described  above  as  discriminant 
variables.  If  a  given  group  was  used  for  training,  the  feature 
values  of  each  case  from  that  group  were  used  to  optimize 
the  coefficients  of  the  linear  discriminant  function.  The  train¬ 
ing  cases  were  then  classified  with  the  linear  discriminant 
function  as  a  verification  of  consistency.  The  other  group  was 


used  as  test  cases  of  which  the  feature  values  were  input  to 
the  classifier  and  the  discriminant  score  of  each  case  was 
calculated  from  the  linear  discriminant  function.  One  of  the 
two  groups  was  alternately  used  as  the  training  group  so  that 
the  variability  of  the  classifier  with  different  training  groups 
could  be  observed. 

Receiver  operating  characteristic  (ROC)  anaiysis^^  -^*^  was 
used  to  evaluate  the  overall  performance  of  the  linear  dis¬ 
criminant  functions,  in  addition  to  the  classification  results 
reported  by  the  SPSS  program  under  certain  prior  probability 
assumptions.  For  a  two-class  problem,  the  ROC  curve  could 
be  obtained  using  the  Bayes’  rule  by  changing  the  prior  prob¬ 
ability.  Alternatively,  the  discriminant  score  from  the  canoni¬ 
cal  discriminant  function  could  be  used  as  the  decision  vari¬ 
able  in  the  ROC  analysis.  Figure  4  demonstrates  such  a 
distribution  of  discriminant  scores  based  on  the  linear  com¬ 
bination  of  features  calculated  from  wavelet  coefficients  at 
variable  distances.  The  distribution  of  the  discriminant  scores 
of  the  ROIs  in  the  training  or  the  test  group  was  input  into 


Table  I.  Texture  features  selected  by  stepwise  discriminant  analysis, 
(a)  From  F^*^'  and  Fvds 
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(b)  From  Fvn  and  Fwv 
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(a)  •  13  features  ( automatic iselectcd  from  Fv^.  V  5  features  (automatic)  selected  from  Fyos-  Note:  0°  represents  the  average  of  features  at  0°  and  90°:  45° 
represents  the  average  at  45°  and  135°. 

(b)  ■  19  features  (automatic)  from  F^v*  G  29  features  (semiautomatic)  from  Fwv-  A  20  features  (automatic)  from  Fyo-  Note:  Some  distances/angles  are  not 
shown  if  no  feature  was  selected.  0°  represents  the  average  of  features  at  0°  and  90°;  45°  represents  the  average  at  45°  and  135°. 
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DISCRIMINANT  SCORE 

Fig.  4.  An  example  of  the  probability  density  distribution  of  the  discrimi¬ 
nant  .scores  of  the  masses  and  normal  tissue.  The  discriminant  scores  were 
calculated  from  the  canonical  discnminant  function  that  was  optimized  with 
all  672  ROls  with  19  features  selected  from  multiresolution  texture  feature 
space  Fw\  - 

the  LABROCi  program/'  which  provided  a  maximum- 
likelihood  estimation  of  a  binormal  ROC  curve  for  training 
or  testing,  respectively.  The  area  under  the  fitted  ROC  curve, 
A  - .  was  used  as  a  performance  index  for  evaluating  the  dif¬ 
ferent  sets  of  features  selected  from  different  multiresolution 
feature  pools.  The  standard  deviation  (SD)  of  A-  estimated 
bv  LABROCI  was  also  reported.  The  CLABROC  program  was 
employed  to  lest  the  statistical  significance  of  the  difference 
between  A,  values  of  different  sets  of  selected  features.^^ 
The  two-tailed  p  values  were  reponed  in  the  following  com¬ 
parisons. 

111.  RESULTS 

A.  Texture  features  based  on  wavelet  coefficients 

Stepwise  feature  selection  was  performed  with  the  multi- 
resolution  texture  features  extracted  from  the  feature  space 
Fvvt-  Thirteen  features  were  selected  as  shown  in  Table  1(a). 
The  A  -  and  the  estimated  SD  of  the  ROC  curves  are  summa¬ 


Fig.  5.  ROC  curves  for  classifying  masses  from  normal  tissue  with  discrimi¬ 
nant  function  based  on  13  features  selected  from  texture  feature  space  Fv^t  ■ 


rized  in  Table  II.  Figure  5  shows  the  ROC  curves  for  the 
classification  using  the  features  derived  from  the  wavelet  co¬ 
efficients.  The  A,  values  of  0.858  and  0.854  for  testing  of  G, 
and  Gt,  respectively,  are  higher  than  those  of  0.8 17 ±0.027 
{p=0m)  and  0.829 ±0.026  (/?=0.10)  obtained  with  texture 
features  calculated  from  the  SGLD  matrix  at  a  single  dis¬ 
tance  of  20  pixels.^^ 

B.  Texture  features  based  on  original  images  with 
variable  distances 

To  evaluate  whether  the  improvement  of  classification 
over  the  results  using  features  based  on  a  single  distance"'  is 
caused  by  the  low-pass  filtering  in  the  wavelet  transform  or 
by  the  changes  in  the  pixel  distances,  we  used  the  same  13 
features  variables  selected  from  but  the  feature  values 
were  calculated  from  the  SGLD  matrices  based  on  the  origi- 


Table  11.  Comparison  of  the  area  under  the  ROC  curves.  A,,  obtained  from  different  feature  spaces. 


Number 

of  Features 

Feature 

Space 

Features  extracted 
from  scales 

Training  on 
Gj  and  Gi 

Training  on  Gj 
Testing  on  G: 

Training  on  G^ 
Testing  on  Gt 

Aj  (Train) 

(Train) 

Az  (Test) 

Ai  (Train) 

Az  (Test) 

13’ 

Fwt 

I,  2, 4,  8.  16 

0.864±0.0I6 

0.86910.021 

0.85410.023 

0.86810.022 

0.85810.022 

13* 

1 

0.796±0.019 

0.80810.026 

0.78110.027 

0.79810.027 

0.78710.027 

5- 

1  i 

0.7.-5810.021 

0.76610.028 

0.74710.029 

0.75410.029 

0.76010.028 

:o* 

1 

0.88510.014 

0.83410.024 

0.83710.024 

0.90510.018 

0.85710.022 

19’ 

1 

0.87110.015' 

0.88310.019 

0.83610.025 

0.87810.021 

0.85910.022 

19’ 

Fwv 

■BB 

0.88410.014 

0.89910.018 

0,85310.025 

0.88710.021 

0,85910.022 

29^ 

mm 

1.2, 4.  8 

0.88710.014 

0.90410.018 

0.84010.026 

0.90310.018 

0.85510.022 

’Automatic  feature  selection. 

♦  Features  corresponding  to  those  automatically  selected  from  F\vt. 
AFeaturcs  corresponding  to  those  automatically  selected  from  Fwv* 
T Semiautomatic  feature  selection. 
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Fig.  6.  ROC  curves  for  classifying  masses  from  normal  tissue  with  discrimi¬ 
nant  function  based  on  19  features  selected  from  texture  feature  space  Fwv 

nal  images  at  equivalent  distances,  Fvds-  values  of 

the  ROC  curves  with  the  same  training  and  test  groups 
(Table  11)  are  significantly  lower  than  the  corresponding  A. 
values  (/7<0.006)  with  features  extracted  from  the  wavelet 
coefficients. 

Stepwise  feature  selection  was  also  performed  on  the  en¬ 
tire  data  set  of  168  cases  from  the  feature  space  Fyos- 
five  features  selected  are  listed  in  Table  1(a).  When  this  set  of 
features  was  used  to  formulate  the  discriminant  function, 
there  was  no  improvement  in  A-  (Table  II),  compared  with 
the  results  using  the  13  features  with  feature  values  from  the 
same  Fvds  i^pace.  The  differences  between  the  A,  values 
obtained  with  13  features  from  Fsvr  and  the  corresponding 
A.  values  obtained  with  5  features  from  Fvds  statisti¬ 
cally  significant  (;? <0.0002).  When  the  entire  feature  space 
of  Fvd  was  used  in  the  stepwise  feature  selection.  20  features 
were  selected  as  listed  in  Table  Kb).  The  A-  values  for  clas¬ 
sification  in  both  the  training  and  test  groups  are  significantly 
higher  than  those  obtained  with  5  features  from  Fvds  (P 
<0.025).  As  can  be  seen  from  Table  1(b),  12  out  of  the  20 
features  were  selected  from  distances  greater  than  16  pixels. 
This  indicates  that  the  information  at  larger  distances  which 
is  not  present  in  Fvds  is  important  in  the  classification  of 
mass  and  normal  tissue.  Some  of  the  A  .  values  obtained  with 
these  20  features  from  Fvd  axe  higher  than  those  with  13 
features  from  while  the  others  are  lower  than  those 

obtained  from  Fvvt»  with  p  values  ranging  from  0.06  to  0.77. 
This  is  an  indication  that  the  discriminant  power  of  the  fea¬ 
tures  from  Fvd  comparable  to  that  of  the  features  from 
Fwt- 

C.  Texture  features  based  on  wavelet  coefficients  at 
variable  distances 

Figure  6  illustrates  the  ROC  curves  for  training  and  test¬ 
ing  when  stepwise  feature  selection  was  performed  on  the 
texture  features  extracted  from  the  feature  space  Fy/v 
19  features  selected  are  listed  in  Table  Kb).  As  shown  by  the 
A .  values  in  Table  11.  when  the  selected  features  were  used  to 


formulate  the  discriminant  function,  the  classification  results 
for  the  training  sets  improved  in  general,  with  p  values  rang¬ 
ing  from  0.08  to  0.40,  whereas  the  test  results  were  almost 
the  same  as  those  obtained  with  the  13  features  from  F^t* 
with  p  values  of  0.93  and  0.88.  As  can  be  seen  from  the  same 
table,  if  these  19  variables  were  used  on  the  feature  values 
from  Fvd^  the  A.  values  were  similar  to  or  slightly  lower 
than  those  obtained  with  Fy^ny- .  The  differences  are  statisti¬ 
cally  significant  for  A.  (training  on  Gj  and  G2)  and  for  A. 
(testing  on  G2)  at  p<0.03,  and  are  insignificant  for  the  other 
A,  values  with  p  values  ranging  from  0.23  to  0.60. 

'  We  also  selected  the  features  in  two  steps,  referred  to  as 
semiautomatic  selection.  First,  we  input  texture  features  of 
the  same  type,  e.g,,  correlation,  calculated  at  ail  scales  and 
distances  into  the  discriminant  analysis  program.  By  using 
the  stepwise  selection  method  with  reduced  thresholds  for 
the  F  values  for  variable  entry  and  removal,  we  found  the 
scales  and  distances  that  are  important  for  classification  for 
each  texture  feature.  Then  we  applied  the  stepwise  procedure 
again  to  all  features  at  their  selected  scales  and  distances  to 
further  reduce  the  number  of  features.  In  this  way,  29  fea¬ 
tures  were  selected  as  shown  in  Table  1(b).  Although  most  of 
them  were  different  from  the  19  features  selected  automati¬ 
cally,  the  overall  classification  results  did  not  show  much 
difference,  indicating  that  some  of  the  features  used  in  one 
discriminant  function  might  be  linearly  correlated  with  some 
of  the  features  in  the  other  discriminant  function.  The  clas¬ 
sification  results  (Table  II)  improved  slightly  in  the  training 
groups  (with  p  values  ranging  from  0.08  to  0.74)  but  dete¬ 
riorated  in  the  testing  groups  (with  p  values  of  0.24  and 
0.54),  probably  because  the  increased  number  of  features 
used  in  the  discriminant  function  limited  its  capability  for 
generalization. 

IV.  DISCUSSION 

A.  Multiresolution  texture  analysis 

Textures  are  generally  recognized  as  being  fundamental  to 
perception,  although  there  is  no  precise  definition  or  charac¬ 
terization  of  textures  available  in  practice.  Intuitively,  texture 
descriptors  provide  measures  of  properties  such  as  smooth¬ 
ness,  coarseness,  and  regularity.  When  an  image  is  composed 
of  elements  of  texture  primitives,  the  description  of  the  im¬ 
age  by  texture  features  can  be  very  effective.  One  of  the 
advantages  is  that  the  texture  features  are  shift  invariant  and 
can  be  made  orientation  invariant  by  averaging  over  various 
angles.  This  is  very  important  since  the  location  and  orienta¬ 
tion  of  the  mass  in  the  ROI  can  be  arbitrary. 

The  masses  found  in  clinical  mammograms  have  very  dif¬ 
ferent  shapes  and  sizes.  It  is  a  challenge  to  find  a  universal 
feature  or  a  set  of  features  that  can  differentiate  the  masses 
from  the  normal  tissue  and  parenchymal  structures  in  the 
breast.  It  is  also  difficult  to  define  a  priori  an  optimal  reso¬ 
lution  for  the  ROIs.  A  multiresolution  approach  could  pro¬ 
vide  a  scale-invariant  interpretation  of  an  image. 

The  wavelet  transform  is  closely  related  to  the  well- 
known  Fourier  transform  through  the  short-time  Fourier 
transform  or  Gabor  transform.  It  is  considered  a  natural  way 
of  decomposing  the  image  energy  into  different  frequency 


o  <QOC 


1510 


Wei  et  a/.:  Computerized  classification  of  mass  and  normal  tissue 


1510 


bands  through  convolution  with  the  translated  and  dilated 
version  of  a  function  called  the  “mother  wavelet.”"^  Unlike 
the  Fourier  transform  where  the  coefficients  in  the  transform 
domain  do  not  reflect  the  local  spatial  variations,  the  wavelet 
coefficients  retain  the  spatial  variations  of  the  original  image. 

In  the  multiresolution  framework  using  wavelet  decompo¬ 
sition  proposed  by  Mallatr^  the  transform  domain  contains  a 
minimum  set  of  coefficients  from  which  the  reconstruction 
of  the  decomposed  image  is  perfect  or  lossless.  In  the  suc¬ 
cessive  image  decomposition,  the  approximation  image  in 
the  current  scale  is  decomposed  into  an  approximation  image 
and  three  detail  images  in  the  larger  scale.  Once  the  mother 
wavelet  is  chosen,  the  coefficients,  which  contain  one  ap¬ 
proximation  image  and  a  series  of  detail  images  at  different 
scales,  are  nonredundant  and  the  transform  is  one-to-one. 
The  extraction  and  condensation  of  image  information 
through  Mallat's  framework  are  very  efficient.  Thus  the 
wavelet  transform  is  often  used  for  image  compression,  * 

In  the  classification  and  pattern  recognition  problem,  how¬ 
ever,  the  focus  is  on  the  extraction  of  those  features  that  can 
provide  maximum  distinction  among  different  classes  rather 
than  on  the  minimal  representation  of  the  original  image.  In 
our  current  texture  analysis,  we  used  the  approximation  im¬ 
ages  at  different  scales,  which  are  redundant  representations 
of  the  original  image.  Such  representations  may  be  helpful  in 
classification  and  pattern  recognition  applications,  as  demon¬ 
strated  by  the  improvement  in  classification  accuracy  in 
comparison  to  the  results  obtained  with  features  at  a  single 
distance.^^  or  to  the  results  obtained  with  features  at  variable 
distances  without  wavelet  transform. 

The  discrete  wavelet  transform  can  be  described  as  a  cas¬ 
caded  process  with  two  basic  operations:  filtering  and  down 
sampling.  There  are  certain  requirements  for  a  filter  to  be 
wavelet  filter."^  Although  it  is  possible  to  find  optimal  wave¬ 
let  filters  for  cenain  types  of  images,  our  focus  in  this  work 
is  on  the  feasibility  of  multiresolution  features  for  classifica¬ 
tion  of  masses  from  normal  tissue  rather  than  the  optimiza¬ 
tion  of  this  procedure.  Therefore,  an  orthonormal  four-weight 
Daubechies'  filter  with  compact  support^^  was  used  for  our 
image  decomposition.  When  the  down-sampling  process  ef¬ 
fectively  reduces  the  image  size  by  a  factor  of  2  in  each 
direction  as  the  scale  increases,  the  reduced  size  ,of  the  dis¬ 
tortion  at  the  boundary  will  help  keep  as  much  useful  image 
information  as  possible  for  texture  calculation. 

B,  Comparison  of  classification  accuracy  with 
features  from  different  feature  spaces 

To  compare  the  discriminant  power  of  texture  features 
calculated  from  the  wavelet  coefficients  to  those  from  the 
original  images,  we  used  the  feature  variables  with  equiva¬ 
lent  distances  (Fvds)-  losing  the  features  selected  by  the  step¬ 
wise  procedure,  the  classification  results  based  on  the  fea¬ 
tures  from  Fw-t  were  significantly  better  than  those  based  on 
the  features  from  Fvds  -  we  used  the  13  features  automati¬ 
cally  selected  from  formulated  the  discriminant 

functions  based  on  the  texture  feature  values  from  Fvds> 
classification  results  demonstrated  similar  differences.  This 
indicates  that  the  texture  features  at  equivalent  distances 
from  the  wavelet  transform  domain  have  better  discriminant 


power  than  those  from  the  original  images.  However,  when 
texture  features  up  to  distances  of  48  (Fvd*  corresponding  to 
4.8  mm  for  0°  features  and  6.79  mm  for  45°  features)  are 
available  for  feature  selection,  the  discriminant  power  of  the 
texture  features  from  the  original  images  can  reach  as  high  as 
that  of  the  features  from  F^vj  or  Fwv  • 
the  features  selected  from  each  space  shown  in  Tables  1(a) 
and  Kb),  the  texture  information  at  large  distances  is  impor¬ 
tant  for  the  classification  task.  The  feature  space  Fvds 
not  provide  such  important  information,  resulting  in  poor 
classification.  On  the  other  hand,  although  the  features  in 
F^.  were  calculated  at  distances  equivalent  to  those  of 
Fvds  ^  low-pass  filtering  effectively  increases  the.  correla¬ 
tion  distances  of  the  features.  The  structural  information  and 
energy  of  the  original  image  obtainable  at  larger  distances 
than  the  maximum  equivalent  distance  of  16  pixels  are  con¬ 
densed  into  the  wavelet  coefficients  used  for  the  calculation 
of  Fwj.  The  fact  that  the  features  from  Fwv  do  not  provide 
significant  improvement  (at  least  for  the  test  groups)  in  the 
classification  results  indicates  that  the  compression  of  image 
information  is  efficiently  accomplished  by  the  wavelet  trans¬ 
form  so  that  the  additional  information  in  Fwv  is  redundant 
as  expected. 

The  overall  operations  of  the  discrete  wavelet  transform 
can  be  summarized  as  bandpass  filtering  (including  low-pass 
and  high-pass  filtering)  and  downward  sampling  (decima¬ 
tion).  The  approximation  images  with  the  wavelet  coeffi¬ 
cients  are  the  result  of  the  low-pass  filtering  from  convolu¬ 
tion  with  the  orthogonal  scaling  function.“^  The  detail  images 
obtained  through  convolution  with  the  orthogonal  wavelet 
function  contain  the  edge  (or  high-frequency)  information  of 
the  images.  The  texture  features  based  on  the  multiresolution 
approximation  images  demonstrate  improvement  compared 
with  those  based  on  the  original  images  for  the  classification 
of  masses  from  normal  tissue.  This  seems  logical  since,  un¬ 
like  microcalcifications  that  contain  high-frequency  compo¬ 
nents,  the  masses  usually  have  relatively  lower  frequency 
contents.  The  frequency  components  of  the  background  nor¬ 
mal  tissue  are  also  in  the  low-frequency  region,  which  niakes 
the  differentiation  much  more  difficult.  As  the  scale  increases 
(by  downward  sampling  or  by  increasing  the  distances  in 
SOLD  matrix  formulation),  the  spatial  resolution  becomes 
lower  while  the  low-frequency  bands  becomes  narrower.  The 
texture  features  based  on  the  wavelet  coefficients  with  de¬ 
creasing  low-frequency  bandwidth  demonstrate  statistical 
difference  between  masses  and  normal  tissue.  At  the  same 
time,  the  effect  of  the  noise  with  relatively  high  frequency  is 
eliminated.  The  subtle  differences  between  the  masses  and 
the  normal  tissue  in  the  low-frequency  range  are  therefore 
revealed  when  the  difference  in  the  changes  of  the  low- 
frequency  bands  between  them  is  utilized  through  multireso¬ 
lution  analysis.  This  may  explain  our  finding  that  classifica¬ 
tion  results  with  the  multiresolution  textures  are  better  than 
those  with  single  distance  textures,"^  except  for  the  results 
obtained  with  features  selected  from  Fvds  -  Kmay  be  noted 
that  the  maximum  distance  of  16  pixels  used  in  Fvds  is  lower 
than  the  selected  distances  of  20  pixels  in  the  single  resolu¬ 
tion  texture  analysis. 

It  is  expected  that  the  detail  images  in  the  wavelet  trans- 
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form  domain  contain  valuable  information  about  the  differ¬ 
ence  between  masses  and  normal  tissue.  When  radiologists 
observe  some  large,  suspicious  structure,  they  will  usually 
inspect  it  in  more  detail  to  determine  whether  it  is  a  mass. 
However,  we  found  that  using  the  texture  features  based  on 
the  detail  images  in  the  wavelet  transform  domain  to  formu¬ 
late  the  discriminant  function  did  not  result  in  proper  classi¬ 
fication.  It  seems  that  the  statistical  summary  of  the  textures 
used  here  is  not  effective  for  the  detail  images.  We  will  ex¬ 
plore  the  use  of  other  statistical  features  to  extract  the  infor¬ 
mation  contained  in  the  detail  images  in  future  studies. 

The  reason  that  the  features  from  the  wavelet  transform 
improve  the  classification  results  can  also  be  explained  as  the 
result  of  the  low-pass  filtering  operation.  In  this  sense,  other 
low -pass  filters  can  also  be  used.  This  provides  the  possibil¬ 
ity  of  designing  optimal  filters  for  the  masses  so  that  the 
classification  results  can  be  further  improved.  An  advantage 
of  the  wavelet  transform  over  other  low-pass  filters  is  that  it 
provides  an  integral  multiresolution  framework  with  great 
computational  efficiency. 

The  dow'n-sampling  process  in  the  wavelet  transform  ef¬ 
fectively  reduces  the  number  of  pixels  in  the  approximation 
image  at  each  scale.  The  reduced  size  of  the  approximation 
images  at  larger  scales  will  cause  more  variability  in  SOLD 
matnx  formulation,  thereby  affecting  the  accuracy  of  the  tex¬ 
tures  estimated  at  lower  image  resolution.  As  the  scale  in¬ 
creases.  the  statistical  fluctuations  of  the  SGLD  matrix  based 
on  the  smaller  and  smaller  images  cannot  be  neglected  due  to 
the  random  sample  errors.  In  fact,  when  the  approximation 
images  at  scale  32  with  8X8  pixels  (equivalent  to  a  pixel 
distance  of  32)  were  used,  the  texture  features  did  not  show 
any  differences  between  ROIs  containing  mass  and  ROIs 
containing  normal  tissue  due  to  the  small  number  of  pixel 
pairs  for  the  SGLD  matrix  formulation.  To  improve  the  sta¬ 
tistical  accuracy  of  the  SGLD  matrices,  we  used  the  infor¬ 
mation  contained  in  the  decimated  coefficients  in  the  wavelet 
transform  and  increased  the  number  of  discrete  distances  at 
which  the  SGLD  matrices,  thereby  texture  features  in 
could  be  calculated.  Equivalently,  this  implies  that  features 
based  on  the  information  in  the  low-frequency  bands  with 
different  bandwidths  are  used  for  classification.  Although 
this  did  not  significantly  improve  the  classification  .results  for 
the  current  data  set.  the  features  from  Fwv  statisti¬ 

cally  superior  to  those  from  Fv^-y  because  of  the  reduced 
uncertainties  in  the  SGLD  matrices. 

C.  Linear  discriminant  analysis 

The  classification  accuracy  is  dependent  on  the  feature 
variables  in  the  linear  discriminant  function.  We  observed 
that  when  more  features  were  used  for  the  discriminant  func¬ 
tions.  there  was  a  trend  that  the  training  results  would  im¬ 
prove  at  the  expense  of  the  test  results.  This  is  probably 
because  the  classifier  has  too  many  unknown  parameters  and 
is  tuned  toward  the  training  group  when  it  contains  a  small 
number  of  cases.  The  resulting  discriminant  function  may 
not  be  representative  for  the  general  population.  Therefore, 
the  generalization  capability  of  the  classifier  may  deteriorate 
as  the  number  of  features  used  in  the  linear  discriminant 
function  increases.  A  similar  situation  arises  when  other  clas¬ 


sifiers,  e.g.,  neural  network,  are  used.  We  also  observed  that 
the  feature  variables  selected  by  the  stepwise  discriminant 
analysis  was  dependent  on  the  case  samples  in  the  training 
set.  If  we  used  the  training  subgroups  to  select  feature  vari¬ 
ables,  the  feature  variables  selected  from  G]  were  not  iden¬ 
tical  to  those  from  G2.  Therefore,  we  used  the  whole  data  set 
(Gj  and  G2)  to  select  the  feature  variables.  As  the  number  of 
case  samples  increased  in  the  data  set  for  feature  selection, 
the  statistical  uncertainty  of  the  distributions  of  the  vectors  in 
the  feature  space  was  reduced.  This  is  expected  to  improve 
the  robustness  of  the  selected  feature  variables. 

0.  CAD  application 

One  of  our  goals  in  the  development  of  CAD  methods  in 
mammography  is  to  assist  radiologists  in  detection  of  suspi¬ 
cious  masses  on  mammograms  using  computer  vision  tech¬ 
niques.  Before  the  automated  ROI  detection  method  is  fully 
developed,  we  used  manually  extracted  ROIs  to  study  the 
feasibility  of  using  texture  features  for  the  classification  of 
mass  and  normal  tissue  in  different  types  of  breast  paren¬ 
chyma.  The  results  of  this  study  demonstrated  the  potential 
of  using  multiresolution  texture  features  for  the  classification 
task.  The  accuracy  at  an  average  A.  of  0,86  for  the  test  sets 
represents  a  significant  improvement  over  a  single  resolution 
approach.^^  Although  further  improvement  in  the  accuracy  is 
needed  before  clinical  implementation,  the  algorithm  can  be 
incorporated  into  an  automated  mass  detection  program  as  a 
step  to  reduce  false-positive  ROIs.  For  example,  we  can  set  a 
decision  threshold  on  the  ROC  curves  (Fig.  6)  at  a  true¬ 
positive  fraction  of  95%  and  a  false-positive  fraction  (FPF) 
of  55%,  thereby  reducing  45%  of  the  FPs  while  most  of  the 
true  masses  are  retained.  Alternatively,  an  accurate  classifi¬ 
cation  algorithm,  once  developed,  can  also  be  used  indepen¬ 
dently  from  an  automated  detection  algorithm.  For  example, 
it  can  be  implemented  in  a  CAD  workstation  and  used  by 
radiologists  interactively  to  help  differentiate  ROIs  indicated 
by  the  radiologists.  The  texture  information  used  by  the  com¬ 
puter  analysis  may  complement  the  human  visual  perception. 
The  classification  accuracy  required,  the  best  operating  point 
on  the  ROC  curve,  and  the  appropriate  approach  of  CAD 
implementation  that  can  be  most  useful  to  radiologists  are 
important  topics  of  investigation  in  the  future. 

It  is  well  known  that  the  accuracy  of  a  classifier  for  FP 
reduction  depends  on  the  specific  types  of  FPs  generated  in 
the  detection  process,  which  may  vary  with  different  auto¬ 
mated  detection  schemes  or  human  observers.  The  accuracy 
may  also  depend  to  some  extent  on  the  properties  of  the 
image  acquisition  system  used,  such  as  the  amplification 
mode,  dynamic  range,  or  spatial  resolution.  The  coefficients 
in  the  linear  discriminant  function  and  the  selected  feature 
variables  are  expected  to  be  different  when  the  classifier  is 
used  in  conjunction  with  different  detection  programs.  The 
usefulness  of  this  study  lies  in  the  fact  that  we  developed  a 
general  approach  to  the  extraction  of  multiresolution  texture 
features  and  demonstrated  their  effectiveness  in  classification 
of  masses  and  normal  tissue.  When  this  method  is  applied  to 
a  specific  task,  the  classifier  must  be  trained  with  ROIs  rep¬ 
resentative  of  the  population  detected  in  that  process,  using 
the  procedures  developed  in  our  study  as  a  guide.  It  is  also 
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important  that  a  much  larger  number  of  training  samples  than 
that  used  in  this  feasibility  study  is  used  in  order  to  ensure 
the  generalization  capability  of  the  trained  classifier. 


The  signal  can  be  reconstructed  from  the  wavelet  coefficients 
Cj  „  and  the  wavelet  bases  through  the  synthesis  for¬ 

mula 


V.  CONCLUSION 

In  this  study,  we  examined  the  application  of  muitiresolu- 
tion  texture  features  in  the  classification  of  masses  and  nor¬ 
mal  breast  parenchyma.  With  linear  discriminant  analysis, 
we  demonstrated  that  multiresolution  texture  features  from 
the  approximation  images  in  the  wavelet  coefficients  at  dif¬ 
ferent  scales,  Fwt^  provide  significant  irnprovement  in  the 
classification  accuracy  over  the  features  from  the  original 
images  at  equivalent  distances,  Fyos  -  features  from  the 
combination  of  wavelet  coefficients  and  variable  distances, 
Fwv«  can  further  improve  the  classification  accuracy,  al¬ 
though  the  improvement  falls  short  of  statistical  significance. 
The  A  -  under  the  ROC  curve  using  19  features  from  the  Fwv 
feature  space  reached  an  average  of  0.89  for  training  and 
0.86  for  testing.  The  approach  developed  here  can  be  incor¬ 
porated  into  a  CAD  procedure  which  may  assist  radiologists 
in  the  detection  of  suspicious  lesions  on  mammograms. 
While  improvement  in  the  classification  accuracy  is  still  nec¬ 
essary  for  clinical  applications,  our  results  demonstrate  the 
feasibility  of  using  multiresolution  textures  for  the  classifi¬ 
cation  of  masses  from  normal  tissue  on  digital  mammo¬ 
grams. 
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APPENDIX  A:  WAVELET  TRANSFORM 

In  the  following,  we  will  briefly  describe  the  basic  ap¬ 
proach  of  the  wavelet  transform  that  is  related  to  this  paper. 
For  simplicity,  one-dimensional  wavelet  transform  is  dis¬ 
cussed.  Generalization  to  two-dimensional  space  is  straight¬ 
forward. 

In  the  wavelet  transform,  a  signal  f(x)  is  decomposed 
with  a  family  of  real  orthonormal  bases  eA;.n(^)  obtained 
through  translation  and  dilation  of  a  kernel  function  known 
as  the  mother  wavelet: 

tl/j^„(x)  =  2  (A  1 ) 

where  j  and  n  are  integers.  The  wavelet  coefficients  of  the 
signal  fix)  can  be  obtained  through  the  decomposition 

Cj.„=j  J{x)ipj_„{x)dx.  (A2) 


f{x)  =  ^  Cj,„il/j.„(x)-  (A3) 

J,n 

The  mother  wavelet  0(x)  can  be  constructed  from  a  scal¬ 
ing  function  <^jc),  which  satisfies  the  two-scale  difference 

26,27 

equation 

<j>(x)  =  Vl'2  h(k)<i)(2x-k).  (A4) 

k 

The  wavelet  kernel  ^jc)  is  related  to  the  scaling  function  via 

<p{x)  =  ^1  g{k)4>(2x-k),  (A5) 

k 

where 

g{k)  =  {-l)^h(l-k).  (A6) 

Several  conditions  have  to  be  met  in  order  for  the  set  of 
wavelet  functions  in  Eq.  (Al)  to  be  unique,  orthonormal,  and 
have  a  certain  degree  of  regularity.^^  Different  sets  of  coef¬ 
ficients  satisfying  those  conditions  can  be  found  in  the  wave¬ 
let  literature.^^’^^ 

In  the  discrete  wavelet  transform,  fast  recursive  algo¬ 
rithms  for  wavelet  decomposition  have  been  developed.  The 
pyramid  wavelet  algorithm,  which  we  used  for  the  multireso¬ 
lution  image  analysis  in  this  study,  decomposes  the  signal 
into  two  parts  in  the  next,  larger  scale:  an  approximation 
signal  with  the  scaling  function  that  has  low-pass  filter  char¬ 
acteristics,  and  the  detail  signal  with  the  wavelet  function 
that  has  the  bandpass  filter  characteristics.  In  our  two- 
dimensional  wavelet  transform,  we  retained  the  coefficients 
that  corresponded  to  the  scaling  function  <f^x)  at  each  scale 
for  texture  analysis. 


APPENDIX  B:  TEXTURE  FEATURES 

An  SGLD  matrix  element  J)  is  the  joint  probability 
of  the  gray  level  pairs  i  and  j  in  a  given  direction  6  separated 
by  a  distance  of  d  pixels.  For  each  ROI  eight  features  were 
derived  from  its  SGLD  matrix  of  a  given  6  and  d: 

n-l 

energy  =2  2 
1  =  0  ;  =  0 


correlation = 


where  n  is  the  number  of  gray  levels  of  the  image; 

s;r  ■  ( /■  -  -  Hy)p{ij) 

<T,<Ty 

where 

ij  -  I  /I  -  1  n—  1  n-  1 

i=0  ;=0  /=0  j=0 

n- 1  n- 1  n- 1  n- 1 

2  7  2  P(U7)>  2  pi^j) 


;=0  1-0 


j  =  0 


i  =  0 


are  the  mean  and  variance  of  the  marginal  distributions  Px(0 
and  PyO),  respectively; 
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morphological  feature  classifier  and  object  splitting  algorithm 
used  in  the  segmentation  method.  The  image  database  and  the 
complete  two-stage  DWCE  segmentation  method  is  outlined 
in  Section  III.  Finally,  Sections  IV  and  V  contain  the  DWCE 
segmentation  results  along  with  a  discussion  of  the  advantages 
and  limitations  of  the  method. 

II.  Density- Weighted  Contrast 
Enhancement  Segmentation 

We  have  developed  a  new  algorithm  using  DWCE  filtering 
with  Lapiacian-Gaussian  (LG)  edge  detection  for  segmen¬ 
tation  of  low  contrast  objects  in  digital  mammograms.  The 
DWCE  algorithm  is  used  to  enhance  objects  in  the  original 
image  so  that  a  simple  edge  detector  can  define  the  object 
boundaries.  Once  the  object  borders  are  known,  morphological 
features  are  extracted  from  each  object  and  used  by  a  classi¬ 
fication  algorithm  to  differentiate  mass  and  nonmass  regions 
within  the  image. 

A.  DWCE  Preprocessing  Filter 

Edge  detection  applied  to  the  original  digitized  mammo¬ 
grams  has  not  proven  effective  in  detecting  breast  masses 
because  of  the  low  signal-to-noise  ratio  of  the  edges  and 
the  presence  of  complicated  structured  background.  Fig.  1(a) 
shows  a  typical  mammogram  from  our  image  database.  It 
contains  a  single  breast  mass  indicated  by  the  arrow.  This 
mammogram  also  contains  dense  fibroglandular  tissue  in  the 
breast  parenchyma.  Although  the  mass  is  relatively  obvious, 
the  partially  overlapping  tissue  makes  the  detection  process 
difficult.  In  order  to  detect  masses  of  varying  shapes  and 
intensities,  we  propose  using  an  adaptive  filtering  technique 
to  suppress  the  background  structures  and  enhance  any  po¬ 
tential  signals.  Fig.  1(b)  shows  the  preprocessed  mammogram 
of  Fig.  1(a).  The  background  in  this  image  is  substantially 
reduced  by  the  proposed  adaptive  filter,  allowing  object  local¬ 
ization  by  a  simple  edge  detector. 

The  block  diagram  for  the  DWCE  preprocessing  filter  is 
depicted  in  Fig.  2(a).  It  is  an  expansion  of  the  local  contrast 
and  mean  adaptive  filter  of  Peli  and  Lim  [19]  designed  for 
enhancing  images  degraded  by  cloud  cover.  The  original 
image,  Fix.y),  is  initially  passed  through  the  map  rescaler 
shown  in  Fig.  2(b).  The  rescaling  first  determines  an  estimate 
for  the  breast  boundary  i.e.,  the  breast  map,  F^api^yy)- 
This  is  accomplished  by  rescaling  F(x,y)  between  0.0  and 
1.0  based  on  the  maximum  and  minimum  values  within  the 
whole  image  and  then  applying  a  single  threshold.  In  our  case, 
any  image  intensity  value  greater  than  or  equal  to  0.4  was 
considered  pan  of  the  initial  breast  map  estimate.  All  isolated 
objects  were  then  identified  using  the  Laplacian^aussian 
method  described  later  in  Section  II-B.  The  region  within  the 
iargest-area  object  was  then  selected  as  the  final  breast  map. 

Fig.  1(c)  depicts  the  detected  breast  map  for  the 
mammogram  of  Fig.  1(a).  Using  this  breast  map,  the  pixel 
values  within  the  original  image,  F{x,y),  are  again  rescaled 
between  0.0  and  1.0.  The  rescaling  range  is  now  determined 
from  the  histogram  of  pixel  values  within  the  breast  region. 
The  pixel  values  defining  the  maximum  and  minimum  of  the 


Fig.  1.  (a)  A  typical  mammogram  from  our  image  database,  (b)  the  corre¬ 
sponding  DWCE  filtered  image,  and  (c)  the  breast  map  defined  by  and  used 
in  the  map  rescaling. 

rescaling  range  are  set  to  be  the  maximum  and  minimum 
values  containing  at  least  5%  of  the  total  pixel  counts.  This 
prevents  outlying  pixel  values  from  skewing  the  rescaling. 

The  map  rescaling  produces  a  normalized  image,  y), 
and  allows  a  single  set  of  filter  parameters  to  be  applied  to 
all  images  in  the  set.  Fj^’(x.y)  is  next  split  into  a  density 
and  a  contrast  image,  F^ix^y)  and  Fc(x.  y),  respectively. 
The  density  image  is  produced  by  filtering  F^^{x,y)  with 
some  type  of  low  pass  filter  (e.g.,  local  averaging,  Gaussian 
smoothing,  or  median  filtering).  In  the  current  DWCE  filter 
implementation,  zero-mean  Gaussian  smoothing  with  standard 
deviation,  ao.  is  used.  Foix.y)  thus  directly  correlates  to  a 
weighted  average  of  the  local  optical  density  of  the  original 
film.  The  contrast  image,  Fc{x,y),  is  also  created  by  filtering 
Fj\'{x,  y),  but  the  low  pass  filter  is  replaced  with  a  bandpass  or 
high-pass  filter.  In  the  current  version  of  the  DWCE,  Fc(x,y) 
is  created  by  simply  subtracting  a  Gaussian  smoothed  version 
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la) 


(b) 

Fig.  2.  (a)  The  block  diagram  of  the  DWCE  preprocessing  filter  used  for 

image  enhancemem  and  (b)  the  block  diagram  for  the  map  rescaling. 


of  Fyix.y)  from  itself 

Fcix.y)  =  Fy{x,y)  -  {G{x,y)  *  Fy{x,y))  (1) 

where  G{x.y)  is  a  Gaussian  smoothing  filter  with  zero  mean 
and  standard  deviation,  ac.  This  complement  filter  is  selected 
to  allow  the  entire  frequency  information  of  the  normalized 
image  to  be  used  when  ao  —  crc^  Each  pixel  in  the  density 
image  is  then  used  to  define  a  multiplication  factor  which 
modifies  the  corresponding  pixel  in  the  contrast  image 

FKcioo.y)  =  KMiFoix.y))  x  Fc{x,y).  (2) 

This  is  the  essence  of  the  DWCE  algorithm.  It  allows 
the  local  density  value  of  each  pixel  to  weight  its  local 
contrast.  Fig  3(a)-<c)  show  the  density,  contrast  and  weighted 
contrast  images,  respectively,  for  the  digitized  mammogram 
of  Fig.  1(a).  The  weighted  contrast  was  created  using  the 
multiplication  function  shown  in  Fig.  4(a).  Note  that  the 
DWCE  filter  substantially  reduces  the  background  and  noise 
while  retaining  the  significant  breast  structures. 

The  output  of  the  DWCE  filter  is  a  nonlinear  rescaled 
version  of  the  weighted  contrast  image.  Each  pixel  in  the 
weighted  contrast  image  is  used  to  define  a  second  multipli¬ 
cation  value,  K \i{Fyc{x.y)).  The  multiplication  values  are 
then  multiplied  by  the  weighted  contrast  of  the  corresponding 
pixels 

^£(x.y)  =  Kyi{FKc{x.y))  x  Ffcdx.y).  (3) 

This  produces  the  final  filtered  image,  Fsix.y).  Fig.  4(b) 
shows  the  nonlinear  function,  AVl(-)^  used  in  the  current 
DWCE  implementation  and  Fig.  1(b)  again  shows  the  final 
enhanced  image  produced  by  this  single  DWCE  filter  stage. 

The  specific  shapes  for  K\f  and  Kyi  in  Fig.  4  were 
determined  experimentally  by  observing  their  affects  on  the 
detection.  The  shape  of  Km{-)  was  selected  to  reinforce 
(i.e.,  Aa/(-)  >  1.0)  the  contrast  at  pixels  in  Fp{x,y)  with 
medium  to  high  intensity  while  reducing  (i.e.,  Km{^)<  1.0) 
the  contrast  of  low  intensity  pixels  in  the  density  image.  The 


Fig.  3.  (a)  The  density.  Fpix^y).  (b)  contrast.  Fc{x,y).  and  (c) 
weight-conffast.  Fyei^^u)-  images  produced  by  the  DWCE  filter  applied 
to  the  mammogram  of  Fig.  1(a)  along  with  (d)  the  corresponding  LG'objcct 
image. 


rationale  is  that  only  background  is  generally  contained  in 
the  low  intensity  portion  of  Ad(x,  y)  while  masses  and  other 
breast  structures  will  be  seen  at  higher  intensity  values.  Note 
Aa/{0.25)  =  1.0,  so  75%  of  the  intensity  range  will  see 
contrast  enhancement.  This  contrast  multiplication  function 
worked  well  in  enhancing  breast  structures,  but  did  not  provide 
adequate  separation  between  the  structures.  In  order  to  isolate 
more  of  these  structures  and  to  help  equalize  the  contrast 
across  individual  objects,  a  nonlinear  rescaling  was  used. 
From  Fig.  4(b)  we  can  see  that  the  very  low  contrasts  are 
strongly  deemphasized  while  the  highest  contrast  range  is 
slightly  reduced.  This  rescaling  sharpens  the  object  borders  by 
eliminating  many  of  the  low  contrast  edges  that  cause  region 
merging  and  reduces  the  effect  of  extremely  large  contrasts  on 
the  edge  detection. 

The  DWCE  enhancement  scheme  is  very  general.  It  allows 
for  great  flexibility  in  defining  the  density  and  contrast  im- 
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Fis.  4.  Plots  of  (a)  the  weighted  contrast  multiplication  function. 
A‘\/f  Folx.  y)).  and  (b)  the  nonlinear  rescaling  multiplicauon  function. 
A'v/:(FKc(^-y)L  used  in  the  DWCE  filter  implementation. 


ages  by  the  selection  of  filler  parameters.  It  also  allows  for 
significantly  different  degrees  of  enhancement  based  on  the 
selected  multiplication  and  nonlinear  rescaling  functions.  In 
addition,  w'e  have  found  that  the  enhanced  image  is  not  very 
sensitive  to  small  variations  in  K\i[')  and  Ksl{‘)'  Slight 
changes  in  the  multiplication  functions  produced  only  slight 
visible  variations  in  the  image  enhancement  and  had  little 
affect  on  the  detected  edges.  This  was  observed  by  Peli  and 
Lim  as  well  [19],  Since  we  defined  the  multiplication  functions 
in  the  enhancement  filter  empirically,  they  may  not  be  optimal. 
Therefore,  variations  in  the  multiplication  functions  of  Fig.  4 
(or  completely  new  functions)  may  provide  bener  overall 
performance. 

B,  Edge  Detection 

The  primary  motivation  for  enhancement  prefiltering  is  to 
improve  the  detectability  of  mass  edges.  Edges  are  defined  by 
changes  or  discontinuities  in  the  intensity  across  an  image. 


This  feature  is  fundamentally  important  in  image  processing 
because  it  provides  an  indication  of  the  physical  extent  of 
objects  within  the  image.  A  common  approach  to  mono¬ 
chrome  edge  detection  is  to  apply  linear  or  nonlinear  edge 
enhancement  followed  by  a  threshold  operation  [20].  A  simple 
example  of  edse  enhancement  is  discrete  differencing  which 
is  analogous  to  continuous  spatial  differentiation.  However, 
with  discrete  differencing  the  edges  are  directionally  dependent 
upon  the  differencing  operation  used.  This  dependence  can 
be  avoided  by  using  a  Laplacian  mask  which  sharpens  edges 
without  regard  to  direction  [20].  The  performance  of  any 
edge  detector  can  be  severely  degraded  when  an  image  is 
corrupted  with  noise.  To  alleviate  this  problem,  statistical 
edse  detection  methods  have  been  developed  when  the  form 
of  noise  disturbance  is  known.  Alternatively,  edge  detection 
algorithms  have  to  be  combined  with  smoothing  filters  to 
improve  their  performance. 

In  this  study,  object  edges  were  detected  from  the  DWCE 
prefiltered  images  using  an  “optimal”  Laplacian-Gaussian 
(LG)  edge  detector.  For  a  given  image,  I{x,y),  the  LG  edge 
detector  defines  edges  as  simply  the  zero  crossing  locations  of 

V^G{x,y)*I(x,y)  (4) 

where  G{x.y)  is  a  two-dimensional  Gaussian  smoothing  func¬ 
tion  [21],  The  degree  of  smoothing  is  controlled  by  a  single 
parameter,  <je^  the  standard  deviation  of  the  smoothing  func¬ 
tion.  This  edge  detector  is  optimal  in  the  sense  that  the  output 
energy  near  the  edge  features  is  maximized  [22],  [23].  It 
also  tends  to  produce  closed  regions  which  makes  detection 
of  isolated  objects  within  the  image  easier.  The  LG  edge 
detector  can  be  applied  recursively  from  lower  resolution 
(large  oe)  to  high  resolution  (smaller  oe)  using  the  edge  map 
of  the  previous  stage  as  a  guide  for  the  current  edge  detector. 
The  multi-resolution  approach  will  improve  the  localization 
of  the  edges  which  are  otherwise  degraded  by  the  Gaussian 
smoothing.  In  this  study,  we  applied  a  single-stage  LG  edge 
detector  with  ar  =  2  because  the  DWCE  filter  alone  provided 
sufficient  noise  reduction. 

Once  the  object  edges  have  been  defined  in  the  image  using 
the  LG  edge  detection,  each  enclosed  object  is  filled.  This 
removes  any  holes  that  may  have  formed  inside  an  object. 
Fig.  3(d)  shows  the  enclosed  filled  edge  regions  produced  by 
extracting  the  LG  edges  from  the  DWCE  filtered  image  of 
Fig.  1(b).  Each  of  the  regions  produced  by  the  filling  is  defined 
by  its  edge  pixels,  thus  forming  a  set  of  detected  objects.  This 
set  of  objects  defines  all  the  detected  structures  within  the 
original  breast  image. 

C-  Object  Splitting 

One  problem  with  the  DWCE  filter  is  that  different  struc¬ 
tures  within  the  breast  can  merge  into  a  single  connected 
region.  The  result  is  multiple  objects  merged  into  a  single 
larger  object.  The  morphological  features  of  these  large  de¬ 
tected  objects  do  not  necessarily  correlate  to  features  of  the 
smaller  breast  masses  and  normal  tissue.  In  order  to  reduce 
the  distortion  due  to  merging,  binary  splitting  is  performed 
on  the  detected  objects.  The  splitting  algorithm  searches 
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Fig.  5.  (a)  The  block  diagram  for  the  splitting  stage  and  (b)  an  example  Fig.  6.  (a)  A  set  of  detected  objects  and  (b)  the  resulting  set  of  objects 

object  containing  the  search  ranges  for  the  global  and  local  cross-section  produced  by  the  splitting  algorithm. 

ratios. 


for  narrowings  in  the  cross  section  of  objects  (i.e.,  necks). 
Fig.  5(a)  depicts  the  block  diagram  for  the  splitting  algorithm. 
The  algorithm  initially  finds  the  cross  section  width  for  each 
column  in  the  object  as  shown  in  Fig.  5(b).  This  produces 
Fx{x)  which  is  a  vector  of  length  n.  In  the  next  stage  of 
the  splitting  algorithm,  the  area  ratio  and  the  global  and  local 
cross-section  width  ratios  are  calculated  for  each  column  of 
the  object.  These  values  are  defined  as 


~  mzx{AR{x).Ai[x)) 


FGbi(x)  =|l.O- 
FLcdx)  =|l.O  - 


Fx(x) 

max.{Fx(z)) 

Fxix) 

max(Fx(:)) 


:z  6.[0.n-  1]| 

:z  €[x-2.x  +  2] 


(5) 

(6) 
(7) 


where  ARix)  and  Ai(x)  are  the  area  of  the  right  and  left 
objects  produced  by  splitting  at  location  x,  and  the  local  and 
global  ranges  are  shown  in  Fig.  5(b).  At  each  potential  neck 
location,  x,  a  cut  value  is  defined  as  a  linear  combination  of 
the  width  of  the  cross-section  ratios  and  area  ratios  of  the  two 
regions  formed  by  the  split 


Fcuti^o)  =  +  WiFuii^)  +  (8) 


The  algorithm  does  not  make  use  of  information  about  the 
likelihood  of  multiple  objects  to  define  split  locations. 

D.  Morphological  Object  Classification 

Since  the  DWCE  prefilter  enhances  both  breast  masses 
and  normal  tissue,  a  large  number  of  detected  objects  are 
usually  found.  In  order  to  reduce  the  number  of  objects  to 
a  manageable  size,  morphological  features  are  extracted  from 
each  object  and  used  in  a  preliminary  screening  of  breast 
masses  from  normal  tissue.  The  morphological  features  used 
in  this  classification  include  the  number  of  edge  pixels,  area, 
shape  and  contrast  of  the  objects.  Two  features,  circularity  and 
rectangular! ty,  are  used  to  characterize  the  shape  of  an  object. 
To  define  these  two  features,  the  bounding  box  containing  the 
object,  and  a  circle  with  area  equivalent  to  the  object  area 
and  centered  at  its  centroid  location  are  first  calculated.  Fig.  7 
shows  the  equivalent  area  circle,  FEq(x,y)  with  radius 

and  the  bounding  box  for  the  object,  Fbb(^:2/)-  The  defini¬ 
tions  of  circularity  and  rectangularity  are  then  given  by 


For  the  present  study  Wc,  Wi.  and  were  chosen  to  be 
1.5,  2.0,  and  1.0,  respectively.  A  maximum  cut  value  for  all 
narrowings  in  the  vertical,  horizontal,  45°  and  135°  directions 
is  found  and  compared  to  a  minimum  cut  threshold.  If  the 
maximum  cut  value  exceeds  this  threshold  the  object  is  split 
at  that  point,  otherwise,  it  is  left  unchanged.  Fig.  6(a)  and  (b) 
shows  a  typical  set  of  detected  objects  in  the  image  before  and 
after  splitting,  respectively.  Round  shaped  objects  were  not  cut 
while  objects  with  necks  were  split  in  appropriate  locations 
and  directions.  The  advantage  of  this  algorithm  is  that  by 
incorporating  the  area  ratio  into  the  cut  location,  preference  is 
given  to  neck  locations  near  the  center  of  the  object.  Note  that 
this  splitting  algorithm  is  applied  to  the  binary  object  images. 


Circularity  = 
Rectangularity  = 


area  (Fq6j  H  Psq) 
area(Fo6j) 
area(Fobj) 

area(FBB) 


(10) 

(11) 


Using  these  five  morphological  features  classification  was 
performed  on  the  detected  objects. 

The  morphological  classification  is  not  meant  to  be  a  final 
classification  of  the  detected  objects.  Instead,  it  is  used  to 
reduce  the  number  of  objects  so  that  further  detailed  analysis 
can  be  performed  in  each  region.  Once  the  number  of  objects 
have  been  reduced,  regions  of  interest  (ROI’s)  will  be  extracted 
based  on  the  shape  and  location  of  the  remaining  objects.  More 
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Fi°  7  An  exampieobjeci  with  its  equivalent  circle,  Fe,.  and  bounding  box. 

Fbb-  used  in  the  delinition  of  circularity  and  rectangularity.  respecuvely. 


sophisticated  classification  algorithms  will  then  be  applied  to 
the  extracted  ROI' s  to  differenuate  between  masses  and  normal 
breast  tissue.  Therefore,  the  goal  in  this  classification  block 
is  to  reduce  the  number  of  objects  with  minimal  loss  in  the 
number  of  masses. 

Simple  thresholding,  linear  discriminant  analysis  (LDA), 
[24]  and  a  back-propagation  neural  network  (BPN)  [25]  have 
been  investiaated  as  potential  classification  schemes.  Simple 
thresholding  sets  a  maximum  and  a  minimum  value  for  each 
morphological  feature.  If  a  detected  object  falls  within  the 
bounds  for  each  of  the  features,  it  is  kept  as  a  potential  mass; 
otherwise  it  is  considered  to  be  normal  tissue.  LDA  forms  an 


Michiaan.  The  mammograms  were  acquired  using  a  Kodak 
MinR/MRE  screen/film  system  with  extended  cycle  process¬ 
ing  The  mammography  systems  have  a  0.3-mm  focal  spoL 
a  molvbdenum  anode,  0.03-mm-thick  molybdenum  filter  and 
a  5:1  reciprocating  grid.  All  systems  have  been  certified  by 
the  American  College  of  Radiology  (ACR).  and  the  image 
quality  is  monitored  according  to  the  ACR  s  recommended 
auidelines.  Our  selection  criterion  was  simply  that  a  biopsy- 
proven  mass  could  be  seen  on  the  mammogram.  Our  data  set 
in  this  preliminary  study  was  composed  of  25  mammograms. 
The  size  of  the  masses  ranged  from  6  mm  to  26  mm  with 
a  mean  size  of  12.4  mm  and  included  1 1  malignant  and  14 
benign  masses. 

The  mammograms  were  digitized  with  a  LUMISYS  DIS- 
1000  laser  film  scanner  with  a  pixel  size  of  100  /xm  x  100  ^m 
and  4096  gray  levels.  The  DIS-1000  logarithmically  amplifies 
the  lieht  transmitted  through  the  mammographic  film  before 
digitiMtion  so  that  the  gray  levels  are  linearly  proponional  to 
optical  densities  in  the  range  of  0.1  to  2.8  optical  density  units 
(O.D.).  The  O.D.  range  of  the  scanner  was  0  to  3.5  with  large 
pixel  values  in  the  digitized  mammograms  corresponding  to 
low  O  D.  The  digitized  images  are  approximately  2000  x  2000 
pixels  in  size.  Before  the  DWCE  segmentation  was  applied 
the  images  were  smoothed  within  an  8  x  8  pixel  window 
using  local  averaging  and  then  subsampled  by  a  factor  of  8. 
This  resulted  in  images  of  approximately  256  x  256  pixels 
for  processing.  Our  data  set  was  composed  of  25  of  the 
subsampled  mammograms  and  will  be  referred  to  as  the 
“subsampled”  mammogram  set  in  the  following  discussion. 


optimal  linear  combination  of  the  features  which  maximizes 
the  group  mean  separation  of  the  mass  and  nonmass  objects, 
jf  features  follow  a  multivariate  normal  distribution  with 
an  identical  covanance  matrix  for  both  groups,  then  LDA  will 
yield  the  optimal  classification.  This  linear  combination  forms 
a  single  discriminant  score  for  each  detected  object  [24],  [26]. 
BPN'^also  forms  a  single  discnminant  score  for  each  detected 
object,  but  it  finds  the  best  nonlinear  combination  of  feamres 
that  minimizes  the  cost  associated  with  misclassification.  In 
our  implementation,  the  BPN  consists  of  an  input  layer,  an 
output  layer,  and  one  hidden  layer.  Each  layer  contains  a 
number  of  nodes  interconnected  to  all  nodes  in  the  previous 
and  the  subsequent  layers  by  weights.  A  weighted  sum  of 
node  values  from  the  previous  layer  stimulates  a  node  in 
the  subsequent  layer  through  a  nonlinear  sigmoidal  activation 
function.  The  neural  network  learns  by  supervised  feedforward 
back-propagation  training  of  the  interconnecting  weights  [25]. 
A  threshold  applied  to  the  LDA  or  BPN  discriminant  score 
provides  a  means  for  separating  potential  breast  masses  from 
the  normal  tissues.  Discriminant  scores  above  the  threshold  are 
classified  as  potential  masses  while  scores  falling  below  the 
threshold  are  considered  as  normal  tissue  and  thus  discarded. 

III.  METHODS 

A.  Database 

The  clinical  mammograms  used  in  this  study  were  ran- 
domly  selected  from  the  files  of  patients  who  had  undergone 
biopsy  in  the  Department  of  Radiology  at  the  Univenity  of 


B.  DWCE  Implementation 

Fig.  8  shows  the  block  diagram  for  the  DWCE  imple¬ 
mentation  used  to  detect  breast  masses  in  the  25  digitized 
mammograms.  It  was  performed  using  two  DWCE  stages. 
In  the  first  stage,  the  DWCE  prefiltering,  edge  detecQon 
and  simple  thresholding  classification  were  applied  to  each 
subsampled  mammogram,  and  the  potential  mass  objects  were 
identified.  For  each  potential  mass  object,  an  ROI  was  ex¬ 
tracted  from  the  corresponding  subsampled  mammogram  using 
the  bounding  box  of  the  object  to  define  the  region.  The 
minimum  size  for  the  extracted  ROTs  was  chosen  to  be 
32  x  32  pixels.  Any  object  with  a  bounding  box  smaller  than 
this  size  had  its  bounding  box  uniformly  expanded  in  each 
direction  (horizontal  and  vertical)  until  it  reached  32  x  32 
pixels.  This  expanded  bounding  box  was  then  used  to  define 
the  extracted  ROI  region. 

Each  of  the  extracted  object  ROl’s  were  then  passed  through 
a  second  DWCE  stage.  This  stage  included  DWCE  prefiltering, 
edge  detection,  object  reduction,  splitting  and  classification. 
The  parameters  used  in  the  DWCE  prefiltering,  edge  det^on 
and  object  reduction  steps  were  identical  to  those  of  their  first 
sta^e  counterparts.  In  the  second  stage  classification,  the  sim- 
ple'thresholding.  LDA  or  BPN  (five  input  nodes,  three  hidden 
nodes  and  a  single  output  node)  classifiers  were  applied  to  the 
detected  objects.  This  allows  the  three  classification  schemes  to 
be  compared.  The  detection  accuracy  was  evaluated  in  terms  of 
the  number  of  true  positives  (TP’s)  for  a  given  number  of  false 
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Fig.  8.  The  block  diagram  of  the  complete  two  stage  DWCE  segmentation 
method  used  for  breast  mass  detection. 
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Fig.  9.  (a)  The  detected  objects  obtained  by  the  first  stage  and  (b)  second 

stage  of  the  DWCE  segmentation. 

positive  (FP)  detections.  A  TP  was  considered  as  an  object 
whose  area  overlaps  the  centroid  of  the  biopsy  proven  mass 
as  identified  by  a  radiologist.  Each  mammogram  in  our  data 
set  contained  a  single  TP  object.  All  other  objects  classified  as 
potential  masses  were  considered  as  FP  detections.  Fig.  9(a) 
and  fb)  shows  the  detected  objects  from  the  mammogram 
of  Fig.  1(a)  after  the  first  and  second  stages,  respectively,  in 
the  DWCE  segmentation.  A  simple  thresholding  classifier  was 
used  for  object  classification  after  both  stages  in  this  example. 

IV.  Results 

The  DWCE  segmentation  method  was  used  to  extract  po¬ 
tential  mass  objects  from  the  25  subsampled  mammograms. 
After  the  first  stage  of  the  DWCE  segmentation,  a  total  of  481 
potential  mass  objects  were  detected  including  24  of  the  25 
true  mass  objects.  Local  ROI’s  were  then  extracted  using  the 
object  bounding  boxes  and  passed  through  a  second  DWCE 
filtering  stage.  This  second  stage  produced  218  detected  ob¬ 
jects  which  increased  to  461  total  objects  after  the  splitting 
stage.  The  split  object  set  again  included  24  of  the  25  true 
mass  objects. 


Number  of  FPs  per  image 

Fig.  10.  Plot  of  the  tradeoff  between  TP  and  FP  detections  for  thresholding, 
LDA  and  BPN  classification. 

The  morphological  features  from  each  of  the  461  split 
objects  were  calculated.  The  sets  of  features  were  then  used  to 
classify  the  objects  as  potential  masses  using  the  thresholding, 
LDA,  and  BPN  classification  schemes.  To  perform  this  training 
classification,  the  feature  sets  were  inpur  into  the  classifiers 
with  known  desired  output  for  each  object  and  the  classifiers 
were  trained  to  provide  the  best  classification.  Fig.  10  sum¬ 
marizes  the  trade-off  between  the  TP  fraction  and  the  number 
of  FP  detections  for  each  of  the  three  classification  methods, 
using  both  the  mass  and  nonmass  training  features. 

V.  Discussion 

The  first  stage  of  DWCE  segmentation  is  applied  globally 
to  the  entire  breast  image.  Its  primary  function  is  to  define 
a  set  of  local  regions  likely  to  contain  the  true  mass  objects. 
In  the  present  study,  the  first  stage  provided  this  capability  by 
detecting  24  of  the  25  true  masses  in  our  data  set.  Splitting  was 
not  applied  to  the  delected  objects  in  the  first  stage  because 
the  detected  regions  were  usually  much  smaller  than  the  true 
structures  seen  on  the  original  mammogram.  In  this  study,  the 
average  area  of  the  first  stage  objects  was  54.5  pixels.  These 
smaller  objects  produced  less  region  merging.  Therefore,  only 
a  simple  thresholding  classification  was  used  to  reduce  the 
initial  number  of  regions.  For  comparison  purposes,  LDA  and 
BPN  classification  were  also  applied  following  the  first  DWCE 
stage.  The  best  classifier  produced  over  10  FP’s  per  image  at 
a  90%  TP  detection  rate.  This  result  is  significantly  larger  than 
the  results  from  the  second-stage  classification  and  highlights 
the  need  for  a  second  filtering  step. 

The  second  DWCE  filter  is  the  main  detection  stage  of 
the  segmentation  scheme.  The  enhancement  is  applied  locally 
to  each  object  ROl.  This  allows  the  filter  to  adapt  to  the 
intensity  distribution  within  each  ROI,  thereby  reducing  the 
effects  of  intensity  variation  across  the  full  mammogram 
as  seen  by  the  first  filtering  stage.  The  improved  detection 
leads  to  larger  objects  having  an  average  area  of  70.2  pixels. 
The  increased  object  size  can  be  attributed  to  both  more 
precise  edge  information  (i.e.,  better  edge  localization)  and 
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resion  merging  associated  with  the  segmentation.  Because 
there  was  significant  region  merging  in  the  second  stage, 
splitting  was  applied  to  all  the  potential  mass  objects.  Object 
splitting  allows  the  merged  regions  to  be  separated  without 
losing  any  of  the  detected  masses.  However,  it  can  introduce 
false  edge  locations  in  the  detected  objects,  lessening  the 
effectiveness  of  the  classification  stage.  Alternatively,  merging 
can  be  reduced  by  applying  a  more  stringent  DWCE  filter. 
Note  that  stringent  DWCE  filters  can  be  created  by  increasing 
the  intensity  thresholds  of  A'a/(-)  (minimum  value  where 
>  1.0)  and  K\i{')  (minimum  value  where  Km{-)  = 
1.0).  How'ever.  it  was  found  that  there  was  a  tradeoff  between 
region  merging  and  the  number  of  missed  masses.  The  current 
combination  of  DWCE  filter  parameters  and  region  splitting 
was  found  to  provide  the  best  mass  detection  capability  and 
produced  a  minimum  number  of  regions  compared  with  other 
combinations  evaluated.  The  flexible  form  of  the  DWCE  filter 
leaves  open  the  possibility  that  further  optimization  of  the 
detection  can  be  explored  with  a  completely  different  set  of 
filter  parameters.  Evaluation  of  different  DWCE  filters  with 
different  functions  for  A'.\/(-;  and  will  be  pursued 

in  future  studies.  In  this  preliminary  study,  our  goal  is  to 
demonstrate  the  feasibility  of  using  DWCE  filters  for  the 
detection  of  breast  masses. 

The  final  classification  stage  was  used  to  further  reduce  the 
number  of  FP  regions  detected.  The  training  results  for  the 
thresholding,  LDA.  and  BPN  classifiers  were  compared  using 
the  461  objects  produced  by  the  second  DWCE  stage.  The 
number  of  FP's  was  initially  reduced  from  18.4  regions  per 
image  to  only  4.5  regions  without  increasing  the  number  of 
missed  masses.  For  a  909c  TP  detection  rate,  the  FP's  can  be 
reduced  to  3.0  per  image.  Fig.  10  shows  that  the  thresholding 
method  and  the  BPN  classifier  provided  comparable  results 
with  the  BPN  slightly  better  when  a  few  additional  misses 
can  be  tolerated.  On  the  other  hand,  the  LDA  classifier 
consistently  produced  a  slightly  larger  FP  rate  for  a  given  TP 
rate  compared  with  both  the  thresholding  and  BPN  classifiers. 
However,  the  only  significant  difference  between  LDA  and 
the  other  two  classifiers  is  seen  at  the  969c  TP  detection 
rate.  Again,  the  training  results  in  Fig.  10  are  based  on  only 
five  morphological  features.  Additional  features  may  improve 
individual  classifier  performance.  Further  reduction  in  the 
FP's  may  also  be  obtained  using  more  sophisticated  tissue 
classification  algorithms  either  independently  or  in  conjunction 
with  the  morphological  classifiers  [26],  [27]. 

A  TP  detection  rate  of  969c  with  only  4.5  FP  per  image 
shows  that  the  DWCE  can  be  used  to  detect  breast  masses  on 
disitized  mammograms.  Its  main  advantage  is  that  it  adapts  the 
enhancement  to  the  local  density  or  background  in  the  image. 
This  enables  subtle  as  well  as  obvious  masses  superimposed 
on  structured  background  to  be  detected.  Since  the  DWCE  pro¬ 
vides  high  frequency  edge  information,  morphological  features 
based  on  object  boundaries  can  be  used  in  combination  with  a 
classification  scheme  to  reduce  the  number  of  detected  regions. 
The  edge  information,  however,  is  not  complete  because 
of  region  merging  and  the  subsequent  splitting  operation 
which  introduces  further  errors  in  edge  localization.  The  edge 
locations  are  also  affected  by  the  DWCE  filter  parameters  as 


seen  in  Fig.  9.  The  current  DWCE  implementation  produces 
conservative  estimates  for  the  true  edges  of  the  objects.  In 
other  words,  the  estimated  edges  fall  within  the  true  boundary 
for  isolated  breast  structures. 

VI.  CO.NCLUSION 

The  results  of  the  DWCE  segmentation  indicates  that  it  is  a 
viable  option  for  automated  mass  detection  in  mammography. 

It  effectively  segmented  the  digitized  mammograms  into  a 
small  number  of  potential  breast  masses  without  a  significant 
loss  in  the  number  of  true  masses.  In  this  preliminary  study,  the 
thresholding,  LDA  and  BPN  morphological  feature  classifiers 
were  evaluated  using  a  limited  training  data  set.  The  initial 
results  indicate  that  nonlinear  combinations  of  the  features 
are  slightly  more  effective  for  FP  reduction.  Further  studies 
are  currently  under  way  using  a  larger  set  of  images.  This 
larger  image  set  will  be  used  to  determine  if  additional 
morphological  features  are  necessary  and  to  determine  if  the 
BPN  classifier  is  truly  the  best  choice.  The  utility  of  the 
currently  trained  DWCE  segmentation  method  (i.e.,  structure, 
filter  and  classification  parameters)  will  also  be  evaluated  using 
a  unique  subset  of  test  images  from  the  new  database.  A 
study  involving  a  set  of  extracted  ROFs  from  the  original 
high  resolution  mammograms  based  on  the  delected  DWCE 
objects  is  also  being  conducted.  Its  purpose  is  to  investigate  if 
more  sophisticated  feature  extraction  and  tissue  classification 
schemes  can  further  reduce  the  number  of  FP  detections  and 
determine  the  malignancy  of  each  detected  mass. 
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Our  prevnous  receiver  operating  characterisUc  (ROC)  study  indicated  that  the  detection  accuracy  of 
microcalcincations  by  radiologists  is  significantly  reduced  if  mammograms  are  digitized  at  0.1 
mm  X  0.1  mm.  Our  recent  study  also  showed  that  detection  accuracy  by  computer  decreases  as  the 
pixel  size  increases  from  0.035  mmXO.035  mm.  It  is  evident  that  very  large  matrix  sizes  have  to  be 
used  for  digitizing  mammograms  in  order  to  preserve  the  information  in  the  image.  Efficient 
compression  techniques  will  be  needed  to  facilitate  communication  and  archiving  of  digital  mam¬ 
mograms.  In  this  smdy,  we  evaluated  two  compression  techniques:  full  frame  discrete  cosine 
transform  (DCT)  with  entropy  coding  and  Laplacian  p\Tamid  hierarchical  coding  (LPHC).  The 
dependence  of  their  efficiency  on  the  compression  parameters  was  investigated.  The  techniques 
were  compared  in  terms  of  the  trade-off  between  the  bit  rate  and  the  detection  accuracy  of  subtle 
microcaldncations  by  an  automated  detection  algorithm.  The  mean-square  errors  in  the  recon¬ 
structed  images  were  determined  and  the  visual  quality  of  the  error  images  was  examined.  It  was 
found  tha:  with  the  LPHC  method,  the  highest  compression  ratio  achieved  without  a  significant 
degradatica  in  the  detectability  was  3.6:1.  The  full  frame  DCT  method  with  entropy  coding  pro¬ 
vided  a  hiiher  compression  efficiency  of  9.6: 1  at  comparable  detection  accuracy.  The  mean-square 
errors  did  not  correlate  with  the  detection  accuracy  of  the  microcalcifications.  This  study  demon¬ 
strated  the  importance  of  determining  the  quality  of  the  decompressed  images  by  the  specific 
requirements  of  the  task  for  which  the  decompressed  images  are  to  be  used.  Further  investigation  is 
needed  for  selection  of  optimal  compression  technique  for  digital  mammograms.  ©  1996  Amen- 
can  Assocution  of  Physicists  in  Medicine, 

Key  words;  mammography,  digital,  microcalcifications,  image  compression,  computer-aided 
diagnosis 


I.  INTRODUCTION 

X-ray  mammograpty  is  the  most  effective  method  in  detec¬ 
tion  of  early  breas:  cancers.^  Because  of  the  stringent  re¬ 
quirements  for  imaging  of  subtle  lesions,  the  image  record¬ 
ing  systems  for  mammography  have  to  provide  very  high 
spatial  resolution  and  high  contrast  sensitivity.  At  present, 
screen-film  systems  specially  designed  for  mammography 
are  the  only  recordmg  medium  that  can  provide  the  image 
quality  needed.  Hcwever,  because  of  the  advancement  in 
digital  imaging  technology,  digital  mammography  is  becom¬ 
ing  a  realistic  goal.  Digital  mammography  offers  the  advan¬ 
tages  of  electronic  transmission,  consultation,  and  archiving 
as  well  as  image  enhancement  and  computer-aided 
diagnosis.  These  'W’al  potentially  make  mammography  more 
widely  accessible,  reduce  the  cost,  and  improve  the  diagnos¬ 
tic  accuracy  of  mamnography. 

Our  previous  receiver  operating  characteristic  (ROC) 
study  indicated  that  die  detection  accuracy  of  microcalcifica¬ 
tions  by  radiologists  is  significantly  reduced  if  mammograms 
are  digitized  at  0.1  mm  X 0.1  mm.^  Our  recent  study  also 
showed  that  detection  accuracy  by  computer  decreases  as  the 
pixel  size  increases  from  0.035  mm  X  0.035  mm.^  It  is  evi¬ 
dent  that  very  high  resolution  digitization  has  to  be  used  for 
mammograms  in  order  to  preserve  the  information  in  the 


image.  A  18  cm X  24  cm  mammogram  digitized  at  0.05  mm 
X0.05  mm  results  in  a  matrix  size  of  about  4000X5000.  A 
four- view  study  thus  will  provide  160  megabytes  of  digital 
data.  The  transmission  and  archiving  of  such  a  large  amount 
of  data  is  therefore  one  of  the  important  considerations  in 
implementation  of  digital  mammography.  An  efficient  data 
compression  scheme  that  can  reduce  the  amount  of  data 
without  degradation  of  the  image  quality  for  human  and  ma¬ 
chine  interpretation  will  alleviate  these  problems. 

Much  effort  has  been  devoted  to  evaluate  compression 
methods  for  radiological  images.  Most  studies  so  far  applied 
to  digital  chest  radiography^’ because  it  is  the  most  com¬ 
monly  performed  procedure  in  medical  imaging,  and  some 
direct  digital  imaging  systems  for  chest  radiography  are  al¬ 
ready  available.  Recently  several  investigators  have  extended 
their  studies  to  general  radiological  images,  including  com¬ 
puted  tomography  (CT),  and  magnetic  resonance  (MR)  im¬ 
ages  based  on  DCT^^  and  wavelet-type  decomposition 
methods. Some  preliminary  studies  have  also  been  per¬ 
formed  for  digitized  mammographic  images.^^’*^ 

In  this  study,  we  explored  some  of  the  issues  involved  in 
compression  of  mammographic  images  for  applications  in 
computerized  detection  of  microcalcifications.^^  Because  pri¬ 
mary  digital  mammography  systems  are  not  yet  available, 
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digital  mammograms  in  this  study  were  obtained  by  digiti¬ 
zation  of  screen-film  mammograms.  We  selected  two  image 
compression  techniques,  the  Laplacian  pyramid  hierarchical 
coding  (LPHC)“^  and  the  discrete  cosine  transform  (DCT), 
with  full  frame  entropy  coding  (FFEC)^*^  for  processing  of 
mammograms.  The  LPHC  technique  was  chosen  because  of 
its  similarity  to  the  difference-image  technique  that  we  used 
for  enhancement  of  microcalcifications  in  the  automated  de¬ 
tection  algorithm.  The  DCT  technique  was  commonly  used 
for  an  irreversible  image  compression  and  FFEC  was  devel¬ 
oped  to  eliminate  the  block  artifacts  and  improve  compres¬ 
sion  efficiency.  The  DCT-FFEC  technique  was  further 
implemented  using  bit  plane  splitting  to  improve  the  preser¬ 
vation  of  detailed  information.^  We  compared  the  compres¬ 
sion  efficiency  of  these  techniques  for  digitized  mammo¬ 
grams.  The  fidelity  of  the  information  in  the  reconstructed 
image  was  evaluated  by  the  detectability  of  the  microcalcifi¬ 
cations  by  an  automated  computer  program.  The  results  were 
compared  with  the  mean  square  enror  (MSE),  which  was  a 
commonly  used  indicator  of  information  loss  in  image  com¬ 
pression. 


H.  MATERIALS  AND  METHODS 

A.  Data  set  of  digital  mammograms 

Twenty-five  mammograms  were  selected  from  patient 
files  from  the  Department  of  Radiology  at  the  University  of 
Michigan.  All  mammograms  were  acquired  with  American 
College  of  Radiology  accredited  machines  and  recorded  with 
Kodak  Min  R/Min  R  —  E  screen-film  systems.  Each  mammo¬ 
gram  contained  a  cluster  of  subtle  microcalcifications,  the 
presence  of  which  had  been  verified  by  biopsy.  The  mam¬ 
mograms  were  digitized  with  a  high-resolution  laser  scanner 
at  a  pixel  size  of  35  yu,mX35  fjm  and  12-bit  gray  levels.  The 
digitizer  logarithmically  amplified  the  transmitted  light 
through  the  film  before  digitization.  The  scanner  was  cali¬ 
brated  such  that  the  gray  levels  were  linearly  proportional  to 
optical  density  (OD)  in  the  range  of  about  0.1 -2.8  OD.  The 
optical  density  range  of  the  scanner  was  0-3.5  OD. 

Because  of  the  computational  requirement  for  processing 
the  entire  breast  image  that  could  be  greater  than  4000X5000 
pixels,  we  manually  extracted  an  ROI  of  1024X1024  pixels, 
which  contained  the  cluster  of  microcalcifications  from  each 
digitized  image.  The  ROIs  were  used  as  input  images  in  the 
compression  and  detection  studies.  To  establish  a  “truth” 
file  for  the  microcalcifications,  the  coordinate  of  each  indi¬ 
vidual  microcalcification  in  an  ROI  was  identified  manually 
with  a  cursor  on  a  display  workstation.  The  locations  of  the 
microcalcifications  were  verified  by  visually  compared  with 
those  on  the  film  mammograms  using  a  magnifier.  The  co¬ 
ordinates  were  stored  in  a  “truth”  file  and  used  for  scoring 
the  detection  accuracy  by  the  automated  procedure,  as  dis¬ 
cussed  below.  The  total  number  of  microcalcifications  in  the 
25  ROIs  was  293. 
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B.  Laplacian  pyramid  hierarchical  coding  (LPHC) 

The  LPHC  is  a  noncausal  image  coding  method  that  de¬ 
composes  an  image  into  a  low-pass  image  and  a  sequence  of 
sub-band  images,  each  of  which  is  reduced  in  spatial  resolu¬ 
tion  by  a  factor  of  2,  thereby  forming  a  pyramidal  hierarchi¬ 
cal  structure.  '  The  LPHC  technique  implemented  for  this 
study  is  described  in  Appendix  A.  This  compression  method 
is  developed  for  progressive  image  transmission.  The  low 
resolution  version  (the  top  level  in  the  Gaussian  pyramid)  of 
the  image  is  transmitted  first  to  provide  an  early  impression 
of  the  image  content,  progressively  higher  resolution  images 
are  subsequently  transmitted  to  provide  greater  details. 
Transmission  can  be  terminated  as  soon  as  sufficient  image 
information  is  received.  If  the  low  level  images  are  not 
needed  and  therefore  not  transmitted,  the  number  of  bits  per 
image  in  the  transmission  is  greatly  reduced.  Furthermore, 
the  image  size  of  the  top  level  Gaussian  pyramid  image  is 
small  and  the  entropy  of  the  Laplacian  pyramid  images  is 
low  because  of  the  removal  of  the  pixel-to-pixel  correlation, 
the  coding  of  the  decomposed  images  can  be  more  efficient 
than  that  of  the  original  image.  Further  image  compression 
can  be  achieved  by  reducing  the  quantization  levels  of  the 
pixel  values  of  the  Laplacian  pyramid  images.  For  the  pur¬ 
pose  of  this  study,  we  will  investigate  the  effects  of  the  im¬ 
age  reconstruction  levels  and  the  quantization  levels  of  the 
Laplacian  pyramid  images  on  detection  accuracy  by  the 
computer  program. 

C.  Discrete  cosine  transform-fuil  frame  entropy 
coding  (DCT-FFEC) 

Block-DCT  techniques  are  commonly  used  for  compres¬ 
sion  of  continuous-tone  digital  images.  DCT  can  effectively 
localize  most  of  the  image  information  (energy)  in  a  small 
area  in  the  spatial  frequency  domain.  However,  the  division 
of  the  image  into  small  blocks  for  DCT  often  introduces 
blocky  artifacts  to  the  reconstructed  images  when  high  com¬ 
pression  ratios  are  desired. 

The  full  frame  DCT  technique  transforms  the  entire  image 
in  one  block.  It  not  only  eliminates  the  blocky  artifacts,  but 
also  provides  the  advantage  that  the  large-size  DCT  can  lo¬ 
calize  the  image  information  in  a  relatively  smaller  band¬ 
width  than  the  small-size  DCT.  The  coefficients  in  the  full 
frame  DCT  matrix  can  be  quantized  with  linear  or  nonlinear 
methods  and  then  encoded  by  various  coding  techniques.  For 
example,  with  a  full  frame  bit  allocation  (FFBA)  technique,^ 
a  bit-allocation  table  based  on  the  characteristics  of  the  trans¬ 
formed  image  and  the  desired  compression  ratio  is  produced. 
The  table  indicates  the  number  of  bits  designated  for  a  spe¬ 
cific  coefficient  or  groups  of  coefficients.  The  quantized  co¬ 
efficients  are  then  packed  into  the  bit  space  indicated  in  the 
table. 

More  recently,  an  entropy  coding  scheme  that  does  not 
require  a  bit  allocation  table  was  developed  for  full  frame 
DCT.  For  chest  radiographs  and  CT  images,  the  FFEC 
method  was  found  to  be  more  efficient  than  FFBA,  in  that  it 
could  produce  a  lower  degree  of  MSE  at  a  given  compres¬ 
sion  ratio  or  increased  the  compression  efficiency  with  the 
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same  MSE.*^  These  studies  also  indicated  that  a  bit  splitting¬ 
remapping  method  was  useful  in  preventing  errors  in  encod¬ 
ing  for  the  most  significant  bits  and  to  lessen  edge  artifacts 
caused  by  compression  and  decompression.*^  We  therefore 
investigated  the  FFEC  technique  with  and  without  bit  split¬ 
ting  remapping  for  image  compression  in  computer-aided  di¬ 
agnosis  (CAD)  applications.  A  description  of  the  FFEC  tech¬ 
nique  implemented  for  this  study  is  given  in  Appendix  B. 

D.  Mean-square  error  (MSE) 

A  commonly  used  indicator  of  information  loss  in  a 
compression-decompression  scheme  is  the  MSE  between 
the  original  image,  g(x,y),  and  the  decompressed  image, 
g,(x,y),  as  given  by 

—  2  S  (1) 

p  V.r  Vy 

where  N^,  is  the  total  number  of  pixels  in  the  image.  The 
MSE  is  a  global  measurement  of  distortion  of  the  image  by  a 
lossy  compression  technique.  The  use  of  MSE  as  an  indica¬ 
tor  of  information  loss  in  our  computerized  detection  of  mi¬ 
crocalcifications  on  compressed-decompressed  mammo¬ 
grams  was  evaluated  in  this  study. 

E.  Computerized  detection  of  microcalcifications 

We  have  described  our  CAD  algorithm  for  detection  of 
microcalcifications  in  det^l  previously Briefly,  there 
are  three  major  steps  in  the  algorithm;  preprocessing,  seg¬ 
mentation,  and  classification.  In  the  preprocessing  step,  the 
input  digital  mammogram  is  processed  with  a  signal- 
enhancement  filter  and  a  signal-suppression  filter.  The  differ¬ 
ence  of  these  two  filtered  images  results  in  an  image  in 
which  the  structured  background  is  suppressed  and  the 
signal-to-noise  ratio  (SNR)  of  the  microcalcifications  is  en¬ 
hanced.  This  is  also  referred  to  as  a  difference-image  tech¬ 
nique.  In  the  segmentation  step,  the  program  determines  the 
gray  level  histogram  of  the  processed  image  within  the  breast 
region.  A  gray  level  thresholding  technique  is  used  to  locate 
potential  signal  sites  above  a  global  threshold.  The  threshold 
is  changed  iteratively  until  the  number  of  sites  obtained  falls 
within  the  chosen  input  maximum  and  minimum  numbers. 
At  each  potential  site,  a  locally  adaptive  gray  level  thresh¬ 
olding  technique  in  combination  with  region  growing  is  per¬ 
formed  to  segment  the  connected  pixels  above  a  local  thresh¬ 
old,  which  is  calculated  as  the  product  of  the  local  root- 
mean-square  (rms)  noise  and  an  input  SNR  threshold.  The 
characteristics  of  a  segmented  signal  such  as  the  size,  con¬ 
trast,  SNR,  and  its  location,  are  determined. 

In  the  classification  step,  the  computer  program  performs 
three  tests  to  distinguish  signals  from  noise  or  artifacts.  A 
lower  bound  is  imposed  on  the  size  to  exclude  signals  below 
a  certain  size,  which  are  likely  to  be  noise,  and  an  upper 
bound  is  set  to  exclude  signals  greater  than  a  certain  size, 
which  are  likely  to  be  large  benign  calcifications.  A  contrast 
upper  bound  is  also  set  to  exclude  potential  signals  that  have 
a  contrast  higher  than  an  input  number  of  standard  deviations 
above  the  average  contrast  of  all  potential  signals  found  with 
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local  thresholding.  This  criterion  e.xcludes  the  very  hiah- 
contrast  signals  that  are  likely  to  be  artifacts  and  large  benign 
calcifications.  A  regional  clustering  procedure  is  then  applied 
to  the  remaining  signals;  a  signal  is  kept  if  the  number  of 
signals  found  within  a  neighborhood  of  a  chosen  input  diam¬ 
eter  around  that  signal  is  greater  than  an  input  minimum 
number.  The  remaining  signals  that  are  not  found  to  be  in  the 
neighborhood  of  any  potential  clusters  will  be  considered 
isolated  noise  points  or  isolated  calcifications  and  excluded. 
This  clustering  criterion  is  useful  for  reducing  false  positives 
because  true  microcalcifications  of  clinical  interest  always 
appear  in  clusters  on  mammograms.  The  specific  parameters 
used  in  each  step  have  been  described  previously.** 

In  this  study,  a  signal-enhancement  filter  of  2X2  kernel  of 
constant  weights  and  a  signal-suppression  filter,  which  was  a 
box-rim  filter  with  a  20X20  kernel  of  constant  weights 
around  the  rim  and  a  12X 12  central  area  of  0  weights,**  were 
used  for  preprocessing  of  the  decompressed  images  obtained 
with  the  DCT-FFEC  techniques.  The  sum  of  the  weights  was 
normalized  to  unity  in  each  of  the  filters.  For  the  images 
compressed  with  the  LPHC  technique,  we  made  use  of  the 
Laplacian  pyramid  images  to  directly  generate  the  difference 
image.  The  decoded  Laplacian  pyramid  images  at  all  levels 
were  expanded  to  the  original  image  size,  summed  together, 
and  convolved  with  the  5X5  kernel,  w{m,n),  defined  in  Ap¬ 
pendix  A.  The  resulting  bandpass  image  was  used  as  the 
difference  image.  This  was  equivalent  to  using  the  decom¬ 
pressed  image  as  the  signal-enhanced  image  and  the  Nth- 
level  Gaussian  pyramid  image  expanded  N  times  to  the  origi¬ 
nal  image  size  as  the  signal-suppressed  image,  and 
convolving  the  difference  between  the  two  images  with  the 
5X5  kernel. 

F.  Analysis  of  detection  accuracy 

Afte,  passing  the  size,  contrast,  and  the  regional  cluster¬ 
ing  criterion,  the  detected  individual  microcalcifications 
would  be  compared  with  the  “truth”  file  of  the  input  image. 
The  numbers  of  true-positive  (TP)  and  false-positive  (FP) 
microcalcifications  were  scored.  A  detected  signal  .was 
scored  as  a  TP  microcalcification  if  it  was  within  0.35  mm 
from  a  true  microcalcification  in  the  “truth”  file.  Once  a  true 
microcalcification  was  matched  to  a  detected  microcalcifica¬ 
tion,  it  would  be  eliminated  from  further  matching.  Any  de¬ 
tected  microcalcifications  that  did  not  match  to  a  true  micro¬ 
calcification  were  scored  as  FPs.  The  trade-off  between  the 
TP  and  FP  detection  rates  by  the  computer  program  was 
evaluated  by  the  free-response  receiver  operating  character¬ 
istic  (FROC)  analysis'**  by  varying  the  input  SNR  threshold. 
A  low  SNR  threshold  corresponded  to  a  lax  criterion  with  a 
large  number  of  FPs.  A  high  SNR  threshold  corresponded  to 
a  stringent  criterion  with  a  small  number  of  FPs  and  a  loss  in 
TPs.  In  this  study,  the  FP  rate  was  expressed  as  the  number 
of  FPs  per  unit  area  of  the  ROI  image  in  order  to  reduce  its 
dependence  on  the  image  size.**  The  information  content  of 
the  reconstructed  images  was  then  evaluated  by  comparison 
of  the  FROC  curves,  which  indicated  the  detection  accuracy 
of  the  computer  program. 
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Fig.  1.  An  original  ROI  image  with  a  cluster  of  microcalcifications  used  in 
the  dataset  of  this  study.  A  subtle  cluster  of  about  ten  microcalcifications  is 
located  near  the  center  of  the  ROI. 


III.  RESULTS 

Figure  1  is  an  example  of  a  1 024  X  1024-pixel  ROI  from  a 
mammogram  digitized  to  12  bits.  A  subtle  cluster  of  about 
ten  microcalcifications  is  located  near  the  center  of  the  ROI. 
Three  levels  of  the  Laplacian  images  of  the  ROI  at  12  bits,  as 
well  as  the  corresponding  Laplacian  images  quantized  to 
eight  bits,  are  shown  in  Fig.  2(a).  It  can  be  seen  that  the 
frequency  range  of  the  information  in  the  Laplacian  images 
decreased  with  increasing  levels  on  the  pyramid.  The  gray 
level  histograms  of  the  two  level-0  Laplacian  images  were 
plotted  in  Fig.  2(b),  which  illustrates  the  low  entropy  (de¬ 
fined  in  Appendix  A)  in  a  Laplacian  pyramid  image  and  the 
further  reduction  of  entropy  by  requantization.  To  demon¬ 
strate  visually  the  effect  of  requantization  on  image  fidelity, 
we  reconstructed  the  ROI  from  the  three  levels  of  eight-bit 
Laplacian  pyramid  images  and  the  top  level  of  the  Gaussian 
pyramid,  as  shown  in  the  flow  diagram  of  LPHC  in  Appen¬ 
dix  A.  The  error  image  between  the  original  12-bit  image  in 
Fig.  1  and  the  reconstructed  image  is  shown  in  Fig.  3(a). 

For  the  LPHC  method,  we  have  attempted  three  different 
ways  to  eliminate  the  LSBs  in  the  Laplacian  images.  In  the 
first  method,  each  pixel  value  was  divided  by  2^  (/  is  the 
number  of  LSBs  to  be  eliminated)  and  rounded  off  to  the 
nearest  integer.  The  pixel  value  was  multiplied  by  2^  during 
reconstruction.  The  second  method  was  similar  to  the  first, 
except  that  the  quotient  was  truncated  to  an  integer.  The  third 
method  simply  set  the  LSB  to  O’s  by  a  bitwise  AND  opera¬ 
tion  with  a  bit  plane  mask  and  was  the  most  efficient  in  terms 
of  computational  speed  among  the  three.  The  third  method 
yielded  an  image  different  from  the  second  method  because 
the  Laplacian  image  was  a  difference  image  that  contained 
negative  integers.  The  truncation  method  reduced  the  abso¬ 
lute  values  of  both  the  positive  and  negative  integers, 
whereas  the  bit  masking  method  shifted  both  the  positive  and 
negative  integers  to  lower  values  (i.e.,  more  negative  for 
negative  integers).  The  error  images  between  the  original  and 
the  reconstructed  images  with  eight-bit  quantization  using 
the  three  different  bit  reduction  methods  are  shown  in  Fiss. 
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3(a)“3(c),  respectively.  The  round-off  and  the  bit  masking 
methods  for  bit  reduction  resulted  in  similar  error  images, 
but  the  error  image  from  the  truncation  method  had  obvious 
noise  patterns.  As  described  below,  the  USE  of  the  trunca¬ 
tion  method  was  much  larger  than  those  of  the  other  two 
methods.  The  FROC  curves  for  the  LPHC  techniques  shown 
in  Figs.  4  and  6  were  obtained  with  the  bit  masking  method 
because  of  its  computational  speed  and  its  similarities  to  the 
round-off  method. 

With  the  LPHC  method,  both  the  compression  ratio  and 
the  reconstruction  accuracy  depend  on  the  number  of  levels 
on  the  pyramid  that  the  image  is  decomposed  and  recon¬ 
structed.  We  first  investigated  the  dependence  of  the  detec¬ 
tion  accuracy  of  the  computer  algorithm  on  the  pyramid 
level.  Figure  4  shows  the  FROC  curves  for  two  to  four 
Gaussian  pyramid  levels  of  decomposition  and  reconstruc¬ 
tion.  The  Laplacian  images  were  quantized  to  eight  bits  in  all 
cases.  The  images  decomposed  into  three  levels  and  recon¬ 
structed  provided  higher  accuracy  than  those  of  two  and  four 
levels.  Applying  a  paired  t  test  to  the  TP  rates  at  correspond¬ 
ing  FP  rates  and  pooled  over  the  range  of  FPs  between  about 
0. 1  and  1  FP  per  cm^,  as  discussed  previously"*  and  in  Sec. 
IV,  we  found  that  the  TP  rates  for  the  three-level  decom¬ 
posed  images  are  significantly  higher,  with  a  ,  two-tailed  p 
value  of  less  than  0.001,  than  those  for  the  two-  or  four-level 
decomposed  images. 

The  dependence  of  the  detection  accuracy  on  reconstruc¬ 
tion  level  was  also  examined.  The  images  were  decomposed 
to  three  levels  and  the  Laplacian  pyramid  images  were  main¬ 
tained  at  12  bits.  The  images  were  then  reconstructed  to  the 
second  level  (image  size  of  256X256  pixels),  to  the  first 
level  (image  size  of  512X512  pixels),  and  to  the  original 
level  (image  size  of  1024X1024  pixels).  Because  the  first 
and  second  level  images  were  already  low-pass  filtered,  the 
5X5  kernel  was  not  applied  to  the  difference  images  in  the 
detection  process  in  order  to  avoid  further  reduction  in  the 
spatial  resolution.  The  detection  results  were  plotted  in  Fig. 
5.  The  detection  accuracy  decreased  drastically  if  the  images 
were  not  reconstructed  to  the  original  zeroth  level.  The  de¬ 
creases  in  the  TP  rates  are  statistically  significant  from  the 
original  level  to  the  first  level  (p<0.01)  and  from  the  origi¬ 
nal  to  the  second  level  (p  <0.001). 

Based  on  these  results,  images  decomposed  to  the  third 
level  and  reconstructed  to  the  original  level  provided  the 
highest  detection  accuracy.  The  following  studies  of  the  de¬ 
pendence  of  the  detection  accuracy  on  the  bit  depth  of  quan¬ 
tization  were  performed  under  this  condition.  The  quantiza¬ 
tion  of  the  Laplacian  pyramid  images  was  varied  from  six 
bits  to  nine  bits.  The  FROC  curves  for  these  quantization  bit 
depths  were  plotted  in  Fig.  6,  along  with  the  FROC  curve  for 
the  original  12-bit  images.  There  are  no  statistically  signifi¬ 
cant  differences  among  the  curves  for  the  12-bit  to  8-bit  im¬ 
ages  (;?>0.05).  As  the  bit  depth  decreased  further  to  seven 
bits,  the  reduction  in  the  TP  rates  from  those  of  the  12-bit 
images  in  the  range  of  FP  between  0. 1  and  1  per  cm“  became 
statistically  significant  at  p <0.003.  The  corresponding  aver¬ 
age  bit  rate  for  each  of  the  conditions  is  shown  in  the  figure 
legends.  It  indicates  that,  at  eight-bit  quantization  of  the  La- 
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placian  images,  the  images  can  be  compressed  to  an  average 
of  3.28  bits/pixel  without  loss  of  the  detectability  of  the  mi¬ 
crocalcifications  by  the  computer  algorithm. 

With  the  DCT-FFEC  approach,  we  evaluated  three  condi¬ 
tions,  splitting  into  MSB  of  3  and  LSB  of  9,  splitting  into 
MSB  of  4  and  LSB  of  8,  and  without  splitting.  The  detection 
accuracy  in  the  reconstructed  images  for  these  conditions 
was  compared  in  Figs.  7(a)-7(c).  The  parameters  used  for 
the  compression  schemes  are  listed  in  Table  I  in  Appendix  B. 
With  splitting,  the  detection  accuracy  for  the  microcalcifica¬ 
tions  was  similar  to  that  in  the  original  images  at  entropy 
coding  ranges  of  seven  to  two  bits  and  six  to  two  bits  (p 
>0.05).  The  coding  ranges  of  six  to  two  bits  resulted  in  an 


average  bit  rate  of  1.25  for  the  three-  and  nine-bit  splitting 
and  1.82  for  the  four-  and  eight-bit  splitting.  The  detectabil¬ 
ity  dropped  significantly  (p  <0.001)  as  the  range  reduced  to 
five  to  two  bits  for  both  splitting  schemes.  The  error  image 
for  the  three-  and  nine-bit  splitting  with  the  entropy  coding 
range  of  six  to  two  bits  is  shown  in  Fig.  3(d).  Without  split¬ 
ting,  the  detection  accuracy  for  the  entropy  coding  ranges  of 
eight  to  two  bits  and  seven  to  two  bits  was  comparable  to 
that  of  the  original  images  (p>0.05).  The  average  bit  rate  for 
the  seven  to  two  bits  coding  was  1.49.  The  detectability  was 
significantly  lower  for  the  coding  range  of  six  to  two  bits 
fp  <0.001). 

The  MSE  and  the  bit  rate  for  each  of  the  compression 
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schemes  and  conditions  were  averaged  over  the  set  of  25 
images.  The  results  were  plotted  on  a  semilog  scale  in  Fig.  8. 
Each  of  the  solid  curves  shows  the  results  for  a  DCT-FFEC 
scheme  with  a  different  bit  splitting  parameter.  The  range  of 
.bits  for  entropy  coding  was  varied  along  each  curve.  The 
results  for  the  LPHC  method  were  plotted  as  dashed  curves. 
The  lowest  average  bit  rate  that  a  scheme  could  achieve  at 
which  the  detectability  of  the  microcalcifications  by  com¬ 
puter  was  comparable  to  that  of  the  original  images  was 
identified  by  an  arrow  for  each  of  the  compression  schemes. 
The  logarithm  of  the  MSE  appeared  to  be  inversely  propor¬ 
tional  to  the  bit  rate  for  most  compression  schemes.  The 
curves  for  the  DCT-FFEC  schemes  without  or  with  splitting 
were  comparable.  The  lowest  average  bit  rate  achieved  by 
the  DCT-FFEC  method  without  a  degradation  in  the  detec¬ 
tion  accuracy  is  1.25,  corresponding  to  a  compression  ratio 
of  about  9.6:1  in  comparison  to  a  12-bit  images  without  com¬ 
pression. 

For  the  LPHC  technique,  the  curve  for  each  bit  elimina¬ 
tion  method  was  plotted  in  Fig.  8  for  linear  quantization  from 
six  to  nine  bits.  It  can  be  seen  that  the  round-off  and  the  bit 
masking  methods  yielded  similar  MSE  and  bit  rates  for 


90 
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NUMBER  OF  FALSE-POSITIVE  MICROCALCIFICATIONS  PER  Cm2 


Fig.  4.  The  FROC  curx'es  for  images  decomposed  to  the  Nth  level  (refer  to 
th^  flow  diagram  in  Fig.  9)  and  reconstructed  to  the  zeroth  level;  crosses: 
N— 2,  triangles:  N=3,  and  squares:  N=4.  The  Laptacian  pyramid  images 
were  quantized  to  eight  bits. 
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NUMBEP  OF  FALSE-POSITIVE  MICROCALCIFICATIONS  PER  CMxifP 

Fig.  5.  The  FROC  curves  for  images  decomposed  to  the  M=3  level  and 
reconstructed  to  the  zeroth  level  (crosses),  the  first  level  (triangles),  and  the 
second  level  (squares).  The  Laplacian  pyramid  images  were  used  at  12  bits 
without  requantization. 


seven  to  nine  bit  quantization.  For  the  truncation  method,  the 
MSE  was  much  higher  than  the  other  two  methods  for  a 
given  number  of  quantization  bits.  This  is  consistent  with  the 
visual  appearance  of  the  error  image  shown  in  Fig.  3(b).  The 
entropy  of  the  Laplacian  images  from  the  truncation  method 
was  lower  than  those  from  the  other  methods  because  a 
larger  number  of  pixel  values  were  truncated  to  zero  from 
both  the  positive  and  the  negative  pixel  values.  Although  it 
appeared  that  the  detectability  of  the  eight-bit  truncated  im¬ 
ages  did  not  decrease  significantly  compared  to  that  of  the 
original  images,  the  detectability  dropped  more  rapidly  than 
the  other  two  methods  when  the  Laplacian  images  were  fur- 


NUMBER  OF  FALSE-POSITIVE  MICROCALCIFICATIONS  PER  CMm}<2 


Fig.  6,  The  FROC  curves  for  images  decomposed  to  the  /V= 3  level  and 
reconstructed  to  the  zeroth  level.  The  Laplacian  pyramid  images  were  quan¬ 
tized  to  six  to  nine  bits.  The  bit  rate  of  8.47  at  12  bits  was  calculated  for  a 
lossless  compression  with  the  LPHC  technique. 


Fig.  7.  The  FROC  curves  for  images  compressed  and  decompressed  with 
the  DCT-FFEC  technique,  (a)  with  splitting  at  MSB =3  and  LSB=9,  (b) 
with  splitting  at  MSB  =4  and  LSB=8,  and  (c)  without  splitting. 


ther  truncated  to  seven  bits  and  six  bits.  Using  the  bit  mask¬ 
ing  or  round-off  method,  the  lowest  bit  rate  achieved  without 
a  degradation  in  the  detectability  of  the  microcalcifications 
by  the  computer  was  about  3.3,  which  corresponded  to  a 
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eompression  ratio  of  3.6:1  in  comparison  to  a  12-bit  image 
without  compression. 

IV.  DISCUSSION 

The  results  of  this  study  indicate  that  the  DCT-FFEC 
method  can  provide  a  higher  compression  ratio  than  the 
LPHC  method.  The  DCT-FFEC  method  with  bit  splitting  of 
three  MSB  and  nine  LSB  and  entropy  coding  of  six  to  two 
bits  can  achieve  a  compression  ratio  of  9.6:1  without  signifi¬ 
cant  degradation  in  the  detectability  of  microcalcifications  by 
a  computer  algorithm.  On  the  other  hand,  with  the  LPHC 
method  and  the  range  of  parameters  studied,  the  compression 
ratio  is  only  about  3.6:1  if  the  detectability  of  subtle  micro- 
calcifications  has  to  be  preserved.  The  DCT-FFEC  method  is 
thus  about  three  times  more  efficient  than  the  LPHC  method 
if  the  detectability  of  microcalcifications  by  computer  is  used 
as  the  criterion  of  image  fidelity.  Although  computer  vision 
can  be  very  different  from  human  vision  and  the  results  can¬ 
not  be  simply  generalized  to  image  compression  to  be  used 
for  human  readers,  our  results  indicate  that  the  DCT-FFEC 
method  can  retain  high-frequency  information,  such  as  that 
of  the  microcalcifications  better  than  the  LPHC  method. 

Examples  of  the  error  images  obtained  from  subtracting 
the  decompressed  image  from  the  original  image  for  the  dif¬ 
ferent  compression  schemes  with  the  selected  parameters  are 
shown  in  Figs.  3(a)-3(d).  It  can  be  seen  that  the  error  image 
from  the  DCT-FFEC  technique  [Fig.  3(d)]  appears  to  be 
more  random  and  contain  higher  frequencies  than  those  from 
the  LPHC  technique.  The  error  image  from  the  LPHC  tech¬ 
nique  with  truncation  [Fig.  3(b)]  contained  visible  structures 
of  the  mammogram.  This  problem  can  be  avoided  by  a  con¬ 
stant  shift  of  all  pixel  values  of  the  Laplacian  pyramid  im¬ 
ages  to  positive  integers  before  division  and  truncation.  The 
resulting  image  and  detectability  of  microcalcification  will 
then  be  similar  to  the  round-off  and  bit-masking  methods. 


AVERAGE  BIT  RATE  (BrTS/PIXEL) 

Fig.  8.  The  relationship  between  the  mean  square  error  (MSE)  and  the 
average  bit  rate  for  the  various  compression  techniques.  The  solid  curves  are 
for  DCT-FFEC  techniques  and  the  dashed  curves  are  for  the  LPHC  tech¬ 
niques. 


It  can  be  seen  from  Fig.  8  that,  at  the  lowest  average  bit' 
rate  without  degradation  of  detectability  by  the  computer,  the 
MSE  from  the  different  methods  varied  from  about  36  to 
274.  However,  the  MSE  did  not  correlate  with  the  detectabil¬ 
ity  of  the  microcalcifications  by  the  computer.  For  example, 
at  a  higher  MSE  of  274,  the  DCT-FFEC  method  with  three- 
and  nine-bit  splitting  and  entropy  coding  of  six  to  two  bits 
provided  a  higher  detectability  than  the  LPHC  technique 
with  seven-bit  quantization  at  an  MSE  of  154.  The  relative 
image  quality  and  the  information  content  in  the  decom¬ 
pressed  images  therefore  cannot  be  judged  by  comparison  of 
the  MSE.  Experimental  measurement  of  the  detectability  of 
signals  in  the  decompressed  images  has  to  be  performed  for 
both  human  and  machine  observers  in  order  to  determine  the 
loss  of  image  information.  The  acceptable  degree  of  infor¬ 
mation  loss  will  also  be  dependent  on  the  detection  task. 

For  the  purpose  of  this  study,  the  detectability  of  micro¬ 
calcifications  by  computer  on  the  decompressed  images  was 
compared  to  that  on  the  original  images  when  the  same  pre¬ 
processing  method  for  SNR  enhancement  in  the  CAD  algo¬ 
rithm  was  used  for  each  image  compression  method.  For 
example,  for  the  DCT-FFEC  approach,  a  bandpass  filter  used 
in  our  previous  studies*^’^^  was  used  to  extract  the  difference 
image.  For  the  LPHC  method,  the  Laplacian  pyramid  images 
were  used  to  reconstruct  the  difference  image.  It  can  be  seen 
by  comparing  the  highest  detection  curves  in  Figs.  6  that  the 
multiresolution  Laplacian  pyramid  images  can  provide  a 
higher  detectability  than  the  bandpass  filtered  images.  There¬ 
fore,  although  the  LPHC  method  is  less  efficient  than  the 
DCT-FFEC  method  for  image  compression,  the  Laplacian 
pyramid  decomposition  can  be  useful  for  SNR  enhancement 
in  the  CAD  program.  This  was  also  the  motivation  that  we 
chose  to  evaluate  the  LPHC  method  for  image  compression 
in  CAD  applications. 

It  may  be  noted  that,  in  the  LPHC  method,  we  chose  to 
use  linear  quantization  for  compression  of  the  Laplacian 
pyramid  images.  It  is  possible  that  some  other  methods  can 
compress  more  efficiently  these  high-frequency  bandpass 
images  and  improve  the  performance  of  the  LPHC  method. 
The  ranges  of  parameters  tested  in  the  DCT-FFEC  technique 
were  also  somewhat  limited.  We  could  not  explore  exhaus¬ 
tively  the  different  image  compression  methods  or  all  pos¬ 
sible  combinations  of  parameters  for  a  specific  method  in 
this  study.  However,  our  investigation  did  indicate  the  utility 
of  the  DCT-FFEC  technique  and  the  importance  of  proper 
evaluation  of  information  loss  for  mammographic  image 
compression.  More  extensive  comparison  of  various  com¬ 
pression  techniques  is  warranted  in  future  studies. 

One  principal  reason  that  digital  mammograms  require  an 
extremely  high  resolution  is  due  to  the  potential  appearance 
of  subtle  microcalcifications,  which  may  or  may  not  be  as¬ 
sociated  with  breast  cancer.  In  general,  a  mammogram  is 
characterized  as  an  image  with  predominantly  low-frequency 
contents,  except  for  microcalcifications  and  subtle  specula¬ 
tions  and  margin  characteristics  of  masses,  which  may  indi¬ 
cate  an  early  breast  cancer.  In  other  words,  only  a  very  small 
portion  of  the  mammogram  contains  clinically  significant 
image  patterns.  While  these  patterns  can  be  greatly  distorted 
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by  an  image  compression  technique,  to  our  knowledge  no 
global  error  measurement  can  quantify  the  effects.  Many 
conventional  compression  techniques,  therefore,  can  achieve 
a  large  compression  ratio  and  obtain  a  low  MSE  without 
producing  obvious  visual  degradation.  However,  these  image 
compression  techniques  may  degrade  clinically  significant 
information  such  as  microcalcifications.  In  this  study,  our 
machine  observer  indicated  the  potential  loss  of  detectability 
due  to  improper  compression  methods.  The  results  of  this 
preliminary  study  emphasize  the  need  that  special  attention 
should  be  paid  to  the  evaluation  of  image  fidelity  when  com¬ 
pression  techniques  are  applied  to  radiological  images  with 
potential  subtle  disease  patterns.  Optimization  of  image  com¬ 
pression  techniques  based  on  analysis  of  detailed  image  fea¬ 
tures  has  recently  been  pursued  by  Lo  et  al}^ 

As  found  in  our  previous  study,"*  the  frocfit  pro^ram^^ 
could  not  provide  good  fits  to  our  FROC  curves.  Our  attempt 
of  applying  the  alternative  FROC  analysis^®  to  the  detection 
data  and  subsequently  the  clabroc  program^’  to  the  pair  of 
correlated  AFROC  curves  also  failed  to  obtain  reasonably 
fitted  curves.  This  problem  was  probably  caused  bv  the  cor¬ 
relation  of  the  individual  FP  signals  detected  in  an  image  due 
to  the  clustering  criterion  used  in  the  detection  process  We 
therefore  could  not  use  a  fitted  FROC  curve  or  a  single  index 
such  as  the  area  under  the  AFROC  curve*®  for  comparison  of 
the  detection  performance  among  different  conditions.  Be¬ 
cause  a  rigorous  statistical  test  of  the  significance  of  the  dif- 
ferences  between  pairs  of  FROC  curves  was  not  yet  avail- 
able,  we  applied  a  paired  t  test  to  the  TP  values  at  a  given  FP 
to  estimate  the  statistical  significance  of  the  differences  be¬ 
tween  the  detection  accuracy  obtained  from  each  pair  of  con¬ 
ditions.  The  number  of  TP  signals  detected  under  the  first 
condition  for  an  image  at  an  SNR  threshold  that  yielded  a 
given  mean  number  of  FP  signals  was  paired  with  the  corre¬ 
sponding  TP  signals  detected  under  the  second  condition  for 
the  same  image  at  an  SNR  threshold  that  yielded  a  similar 
mean  FP.  The  t  test  was  performed  for  TP  pairs  over  a  range 
of  FP  values  of  interest.  The  inclusion  of  TP  pairs  over  a 
range  of  FP  values  took  advantage  of  the  consistency  of  the 
differences  between  the  two  FROC  curves  over  the  ran^^e  of 
interest,  similar  to  a  curve  fitting  approach.  However?  the 
statistical  significance  might  be  somewhat  overestimated  be¬ 
cause  of  the  potential  correlation  between  the  TP  pairs  at  the 
different  SNR  thresholds.  An  alternative  test  may  be  a  paired 
t  test  of  the  differences  in  the  partial  areas  under  the  FROC 
curves  over  the  range  of  FP  values  of  interest  for  the  corre¬ 
sponding  image  pairs.  The  validity  of  these  tests  may  be 
^duated  when  a  rigorous  statistical  significance  test  for 
FROC  curves  is  developed. 


V.  CONCLUSION 

We  evaluated  two  image  compression  methods  in  this 
study.  It  was  found  that  the  DCT-FFEC  method  with  bit 
splitting  IS  more  efficient  than  the  LPHC  method  with  linear 
quantization  for  compression  of  mammographic  ima<^es 
without  degradation  of  the  detectability  of  subtle  microcdci- 
fications  by  our  automated  detection  algorithm.  The  highest 
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compression  ratio  achieved  without  significant  loss  in  detec¬ 
tion  accuracy  was  9.6:1.  It  was  demonstrated  that  the  MSE 
was  a  poor  indicator  for  comparison  of  information  loss  due 
to  image  compression.  The  evaluation  of  the  acceptability  of 
an  image  compression  technique  should  therefore  be  based 
on  human  and  machine  observer  studies. 
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APPENDIX  A:  LAPLACIAN  PYRAMID  HIERACHICAL 
CODING  (LPHC) 

A  schematic  of  the  LPHC  technique  for  image  compres- 
sion  and  decompression  is  shown  in  Fig.  9.  An  input  image 
Go  is  low-pass  filtered  with  a  local  and  symmetric  weight 
kernel  iv(m,n)  and  then  subsampled  by  every  other  pixel  to 
different  levels  sequentially  according  to  the  foUowing  rela¬ 
tionship: 

REDUCE(G,i_,) 


-GkUj)-  2  'L  w{m,n)Gii.i{2i  +  m,2j  +  n), 

(Al) 

where  1  is  an  index  of  compression  level,  N  is  the 

number  of  levels  in  the  pyramid,  and  (/,;)  is  the  pixel  loca¬ 
tion  in  the  image.  The  matrix  size  of  an  image  at  the  ith 
level  of  the  pyramid,  G*,  is  reduced  by  a  factor  of  4  com¬ 
pared  with  the  a-l)th  level  image,  and  is  refeired  to  as  a 

reduced”  version  of  G,j_,.  The  reduced  image  is  then  ex¬ 
panded  with  a  similar  operation: 

EXPAND(G,)  =  £,_,(i,;) 

m  =  -2n=-2  *\  2  ’  2  /’ 

(A2) 

where  the  summation  is  performed  over  the  terms  for  which 
(/-m)/2  and  iJ  —  n)/2  are  integers.  An  error  image  in  level 
^~1  is  given  by  the  difference  between  and  £*_[: 

L*-,  =  Gi_i-EXPAND(Gt).  (^3) 

By  performing  the  reduction  N  times,  as  shown  in  Fig.  9, 
a  sequence  of  JV  low-pass  filtered  images  with  successively 
reduced  spatial  resolution  and  reduced  sampling  rate  is  ob¬ 
tained.  Since  one  of  the  important  convolution  weight  ker¬ 
nels  resembles  the  Gaussian  probability  distribution,  this  se- 
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Fig.  9.  Schematic  diagram  of  the  Laplacian  pyramid  hierarchical  coding 
(LPHC)  and  decoding  technique. 


quence  of  low-pass  filtered  images  is  referred  to  as  the 
Gaussian  pyramid. 

By  expanding  the  reduced  image  and  calculating  the  error 
image  at  each  level,  a  sequence  of  N  subband  images  of 
reduced  sampling  rate  is  also  obtained.  This  sequence  of  er¬ 
ror  images  is  composed  of  bandpass  filtered  images  and  is 
referred  to  as  the  Laplacian  pyramid.  The  scale  of  the  La¬ 
placian  operator  doubles  from  level  to  level  of  the  pyramid, 
while  the  center  frequency  of  the  passband  is  reduced  by  an 
octave. 

The  original  image  can  be  reconstructed  using  the  highest 
level  Gaussian  pyramid  image,  ,  and  the  sequence  of  La¬ 
placian  pyramid  images,  Lj,,  =  1,  as  shown  on  the 

right-hand  side  of  Fig.  9: 

j  =  L^.  j  +  EXPANDf  G^) .  (A4) 

If  no  lossy  compression  has  been  applied  to  G^  and  the 
Laplacian  pyramid  images,  the  original  image  can  be  recov¬ 
ered  without  loss. 

In  this  study,  we  decomposed  the  top  level  Gaussian  pyra¬ 
mid  image  by  the  differential  pulse  code  modulation 
(DPCM)  technique.^^’^^  The  number  of  bits  required  for  en¬ 
coding  the  DPCM  decomposed  image  was  then  determined 
by  the  entropy  of  its  pixel  value  distribution,  i.e.,  the  histo¬ 
gram  of  its  gray  levels.  For  the  Laplacian  pyramid  images, 
because  their  pixels  values  were  decorrelated  and  could  be 
assumed  to  be  statistically  independent,  then  the  minimum 
number  of  bits  per  pixel  required  to  exactly  encode  the  im¬ 
age  was  given  by  its  entropy.  This  optimum  might  be  ap¬ 
proached  in  practice  through  techniques  such  as  variable- 
length  encoding.  The  entropy  of  the  pixel  value  distribution 
of  an  image  was  given  by 

4095 

Entropy=  -  2  /(Olog:  /(O.  (A5) 

1  =  0 

where  /(i)  was  the  observed  probability  of  occurrence  of 
gray  level  i.  Assuming  that  the  variable-length  code  words 
were  used  in  data  transmission  to  take  advantage  of  the  non- 
uniform  distribution  of  pixel  values,  the  effective  number  of 


bits  for  a  given  Laplacian  pyramid  level  was  its  entropy 
times  its  matrix  size.  The  effective  number  of  bits  per  pixel 
for  the  encoded  image  was  thus  the  sum  of  the  number  of 
bits  for  all  levels  of  the  component  images  divided  by  the 
matrix  size  of  the  original  image. 

Following  the  approach  described  by  Burt  and  Adelson,^* 
a  5X5  kernel  of  weights  w{m,n)  that  was  separable  to 
w{m,n)  =  h(m)h{n)  was  used  in  this  study.  Here  h  was  a 
symmetric  function,  such  that  h{i)=h(-i),  for  i=0,l,2. 
The  weights  were  subject  to  the  constraint  that  all  nodes  at  a 
given  level  contributed  the  same  total  weight  to  nodes  at  the 
next  higher  level.  Therefore,  /i(0)  =  a,  /i(-l)  =  /i(l)  =  j, 
h(-2)  =  h(2)  =  j- a/2.  The  constant  a  was  chosen  to  be  0.4 
in  this  study  to  obtain  a  Gaussian-like  function.  A  uniform 
quantization  by  eliminating  the  least  significant  bits  was  ap¬ 
plied  to  the  Laplacian  pyramid  images.  Although  these  pa¬ 
rameters  might  not  be  optimal  choices  for  mammographic 
images,  they  were  selected  as  typical  values  for  study  of  the 
effects  of  the  LPHC  method  on  mammograms. 


APPENDIX  B:  DISCRETE  COSINE  TRANSFORM- 
FULL  FRAME  ENTROPY  CODING  (DCTF-FEC) 

A  schematic  diagram  of  the  FFEC  technique  with  splitting 
and  remapping  is  shown  in  Fig.  10.  An  input  image  of  (n 
-i-k)  bits  is  split  into  n  most  significant  bits  (MSB)  and  k 
least  significant  bits  (LSB).  Because  the  MSB  in  medical 
images  are  highly  correlated,  they  can  be  encoded  by  corre¬ 
lation  encoding  such  as  Lempel-Ziv  (LZ)  coding  or  nin- 
length/Huffman  coding  with  high  compression  ratio.  For  the 
LSB  image,  the  bits  are  remapped  in  order  to  convert  the 
residual  data  into  an  image  with  a  more  continuous  tone.  The 
remapping  of  the  LSB,  denoted  as  RLSB,  for  an  image  with 
gray  levels  g(x,y)  can  be  expressed  as 

RLSBi(g(x,y))=LSBi(g(x,y)),  for  [g(.t,y)&2*]  =  0, 

(Bl) 

RLSB,fc(g(x,y))  =  2*-  1  -LSBi(g(x,y)), 

for  [g(x,y)&2*)#0,  ”(B2) 

where  is  the  “AND”  operation  for  the  bit  map  of  the 
integers.  The  splitted  and  remapped  image,  RLSB  C?(x,y)) 


Table  I.  The  parameters  used  in  the  discrete  cosine  transform-full  frame 
entropy  coding  technique. 


Zone 

Frequency  region 

No.  of  bits  for  coding 

1 

2 

0-63 

64-127 

Boating  point 

12 

Maximum  Minimum 

3 

128-1023 

8  2 

7  2 

6  2 

5  2 
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Fig.  10.  Schematic  diagram  of  the  discrete  cosine 


transform-full  frame  entropy  coding  (DCT-FFEC)  technique  with  image  splitting  and 


remapping. 


is  then  subject  to  a  two-dimensional  DCT.  The  spatial  fre¬ 
quency  domain  image  is  divided  into  three  zones  for  linear 
quantization.  The  zone  boundaries  and  the  number  of  bits 
used  in  each  zone  are  tabulated  in  Table  I.  In  the  low- 
frequency  zone,  the  DCT  coefficients  are  stored  and  trans¬ 
mitted  ^  the  original  floating  point  values  so  that  no  infor¬ 
mation  is  lost.  In  the  mid-frequency  zone,  the  coefficients  are 
quantized  to  12  bit  integers.  In  the  high-frequency  zone,  the 
coefficient  are  quantized  to  a  range  of  bits  specified  by  an 
input  maximum  and  minimum  number.  The  compression  ra¬ 
tio  IS  large  when  the  maximum  number  of  bits  allowed  is 
small.  A  specific  number  of  bits  to  be  used  for  a  given  coef¬ 
ficient  i^^lhis  zone  is  determined  by  an  energy  allocation 
scheme.  ’  The  ranges  of  seven  to  two  bits,  six  to  two  bits, 
and  five  to  two  bits  were  compared  for  the  compression 
schemes  with  splitting  in  this  study.  The  quantized  coeffi¬ 
cients  were  then  submitted  to  a  statistical  coding  routine  for 
data  packing.  The  standard  statistical  coding  schemes  in¬ 
cluded  arithmetic  and  Huffman  coding,  of  which  the  former 
was  used  in  this  study. 

For  FhtC  without  bit  splitting  and  remapping,  the  proce¬ 
dure  IS  similar  to  those  described  above.  The  only  difference 
IS  that  there  is  no  MSB  image  to  be  encoded.  The  entire 
image  undergoes  DCT,  zonal  quantization,  and  arithmetic 
coding,  as  illustrated  in  the  lower  path  of  Fig.  10.  For  energy 
allocation,  the  ranges  of  eight  to  two  bits  to  six  to  two  bits 
were  evaluated. 

To  decompress  the  FFEC  image,  the  reverse  operation  of 
compression  is  employed.  The  quantized  coefficients  are  de¬ 
coded,  followed  by  reverse  quantization,  and  then  by  the 
inverse  DCT.  Because  of  the  quantization  process  that  re¬ 
duces  the  DCT  coefficients  of  real  numbers  to  integers  with 
a  finite  number  of  bits,  the  reverse  quantization  cannot  re¬ 
cover  the  original  image  information  in  its  entirety.  The  de¬ 
gree  of  information  loss  with  the  FFEC  technique  depends 
on  the  information  content  of  the  input  images  and  the  com¬ 
pression  ratio. 
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